Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-69652

Asynchronous cache replication can cause extra overhead in case of large number cache updates and many stale nodes

XMLWordPrintable

    • 7.09
    • 43
    • Severity 2 - Major
    • 39
    • Hide
      Atlassian Update – 19 May 2023

      Hi everyone,

      We’ve investigated the bug and decided to lower its priority. For this decision, we’ve considered the number of votes and watches, as well as lack of recent activity on this ticket.

      Reason:

      While the underlying problem has not been resolved, the chances of its occurrence are much lower than they used to be. The described cache inefficiency shows up when there are many stale nodes. The workaround suggests removing those. Since Jira 8.10 stale node removal has been automated thanks to JRASERVER-42916

      Next steps:

      If this problem still occurs on your instance, don't hesitate to comment on this ticket and/or contact Atlassian Support, so that we can learn the true impact.

      Best regards,

      Kamil Cichy
      Senior Software Engineer, Jira DC

      Show
      Atlassian Update – 19 May 2023 Hi everyone, We’ve investigated the bug and decided to lower its priority. For this decision, we’ve considered the number of votes and watches, as well as lack of recent activity on this ticket. Reason : While the underlying problem has not been resolved, the chances of its occurrence are much lower than they used to be. The described cache inefficiency shows up when there are many stale nodes. The workaround suggests removing those. Since Jira 8.10 stale node removal has been automated thanks to JRASERVER-42916 .  Next steps : If this problem still occurs on your instance, don't hesitate to comment on this ticket and/or contact Atlassian Support, so that we can learn the true impact. Best regards, Kamil Cichy Senior Software Engineer, Jira DC

      Summary

      Asynchronous cache replication can cause extra overhead in case of large number cache updates and many stale nodes.

      Environment

      • Jira DC
      • A large number of stale nodes (see JRASERVER-42916)
      • Plugin (code) generating a large number of cache update events, eg reaching 2000 messages/min.

      Steps to Reproduce

      1. Open a URL which produces the cache update event while computing the business logic
        • Eg. #* /rest/servicedesk/1/<PRJ>/webfragments/sections/sd-queues-nav,servicedesk.agent.queues,servicedesk.agent.queues.ungrouped
      2. Measure response time and number of replication events

      Expected Results

      Performance doesn't degrade with a number of old nodes.

      Actual Results

      Performance degrades with a number of old stale nodes.

      • While taking thread dumps you can see a lot of threads busy in the following stack:
          java.lang.Thread.State: RUNNABLE
        	at java.io.RandomAccessFile.writeBytes(Native Method)
        	at java.io.RandomAccessFile.write(RandomAccessFile.java:512)
        	at com.squareup.tape.QueueFile.writeHeader(QueueFile.java:184)
        	at com.squareup.tape.QueueFile.add(QueueFile.java:321)
        	- locked <0x00000003ce9b43e0> (a com.squareup.tape.QueueFile)
        	at com.squareup.tape.FileObjectQueue.add(FileObjectQueue.java:46)
        	at com.atlassian.jira.cluster.distribution.localq.tape.TapeLocalQCacheOpQueue.add(TapeLocalQCacheOpQueue.java:151)
        	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpQueueWithStats.add(LocalQCacheOpQueueWithStats.java:115)
        	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheManager.addToQueue(LocalQCacheManager.java:370)
        	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheManager.addToAllQueues(LocalQCacheManager.java:354)
        	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheReplicator.replicateToQueue(LocalQCacheReplicator.java:85)
        	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheReplicator.replicatePutNotification(LocalQCacheReplicator.java:65)
        	at com.atlassian.jira.cluster.cache.ehcache.AbstractJiraCacheReplicator.notifyElementUpdated(AbstractJiraCacheReplicator.java:123)
        	at net.sf.ehcache.event.RegisteredEventListeners.internalNotifyElementUpdated(RegisteredEventListeners.java:228)
        	at net.sf.ehcache.event.RegisteredEventListeners.notifyElementUpdated(RegisteredEventListeners.java:206)
        ...
        
      • From client's case, we saw 15 - 20% of all threads doing replicateToQueue

      Notes

      None

      Workaround

      Clean-up old node data manually, see JRASERVER-42916

              Unassigned Unassigned
              ayakovlev@atlassian.com Andriy Yakovlev [Atlassian]
              Votes:
              6 Vote for this issue
              Watchers:
              19 Start watching this issue

                Created:
                Updated: