Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-72856

Large Localq filesizes can result in Cache Replication errors from Tape

    XMLWordPrintable

Details

    • 7.13
    • 35
    • Severity 2 - Major
    • 146
    • Hide

      We have strong evidence that JRASERVER-73252 might have been the root cause of this issue. Note that JRASERVER-73252 was not only affected by DB restart, but could also be a result of any temporary networking problems between Jira nodes and DB.

      Show
      We have strong evidence that JRASERVER-73252 might have been the root cause of this issue. Note that JRASERVER-73252 was not only affected by DB restart, but could also be a result of any temporary networking problems between Jira nodes and DB.

    Description

      Issue Summary

      At times the replication queue files can become quite large, resulting in errors with cache replication. This has happened across a number of tickets and root cause is still not known. A common theme is the size of localq files.

       

      Steps to Reproduce

      n/a

      Expected Results

      Cache replication continues to work without errors

      Actual Results

      Cache replication fails with these errors in logs:

      2021-09-20 19:03:01,157 http-nio-8080-exec-84 ERROR Jira 1303x259x4 11g2toz 10.12.200.110,10.12.200.110,127.0.0.1 /rest/api/2/issue/TEST-13214/remotelink [c.a.j.c.distribution.localq.LocalQCacheManager] [LOCALQ] Critical state of local cache replication queue - cannot add: LocalQCacheOp{cacheName='com.atlassian.jira.cluster.dbr.DBRMessage', action=PUT, key=DBR, value == null ? false, replicatePutsViaCopy=true, creationTimeInMillis=1632164581156} to queue: [queueId=queue_prodnode2_1_9c79a1ab55dcc55cc4a830cfc917b8a9_put, queuePath=/var/atlassian/jira/application-data/localq/queue_prodnode2_1_9c79a1ab55dcc55cc4a830cfc917b8a9_put], error: Failed to add entry.
      com.squareup.tape.FileException: Failed to add entry.
      	at com.squareup.tape.FileObjectQueue.add(FileObjectQueue.java:61)
      	at com.atlassian.jira.cluster.distribution.localq.tape.TapeLocalQCacheOpQueue.add(TapeLocalQCacheOpQueue.java:167)
      	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpQueueWithStats.add(LocalQCacheOpQueueWithStats.java:121)
      	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheManager.addToQueue(LocalQCacheManager.java:401)
      	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheManager.addToAllQueues(LocalQCacheManager.java:382)
      	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheReplicator.replicateToQueue(LocalQCacheReplicator.java:87)
      	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheReplicator.replicatePutNotification(LocalQCacheReplicator.java:67)
      	at com.atlassian.jira.cluster.cache.ehcache.AbstractJiraCacheReplicator.notifyElementUpdated(AbstractJiraCacheReplicator.java:123)
      	at net.sf.ehcache.event.RegisteredEventListeners.internalNotifyElementUpdated(RegisteredEventListeners.java:228)
      	at net.sf.ehcache.event.RegisteredEventListeners.notifyElementUpdated(RegisteredEventListeners.java:206)
      	at net.sf.ehcache.Cache.notifyPutInternalListeners(Cache.java:1646)
      	at net.sf.ehcache.Cache.putInternal(Cache.java:1618)
      	at net.sf.ehcache.Cache.put(Cache.java:1543)
      	at net.sf.ehcache.Cache.put(Cache.java:1508)
      	at com.atlassian.cache.ehcache.DelegatingCache.put(DelegatingCache.java:93)
      	at com.atlassian.jira.cache.DeferredReplicationCache.lambda$put$0(DeferredReplicationCache.java:60)
      	at com.atlassian.jira.cluster.cache.ehcache.BlockingParallelCacheReplicator.runDeferred(BlockingParallelCacheReplicator.java:172)
      	at com.atlassian.jira.cache.DeferredReplicationCache.put(DeferredReplicationCache.java:59)
      	at com.atlassian.jira.cluster.dbr.transport.DefaultDBRTransportManager.send(DefaultDBRTransportManager.java:45)
      	at com.atlassian.jira.cluster.dbr.DefaultDBRSender.sendDBRMessage(DefaultDBRSender.java:135)
      	at com.atlassian.jira.cluster.dbr.DefaultDBRSender.sendUpdateWithRelated(DefaultDBRSender.java:90)
      	at com.atlassian.jira.cluster.dbr.AsyncDBRSender.lambda$sendUpdateWithRelated$0(AsyncDBRSender.java:68)
      	at co.elastic.apm.agent.concurrent.JavaConcurrent$RunnableLambdaWrapper.run(JavaConcurrent.java:220)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.io.IOException: Invalid argument
      	at java.base/java.io.RandomAccessFile.setLength(Native Method)
      	at com.squareup.tape.QueueFile.setLength(QueueFile.java:406)
      	at com.squareup.tape.QueueFile.expandIfNecessary(QueueFile.java:375)
      	at com.squareup.tape.QueueFile.add(QueueFile.java:305)
      	at com.squareup.tape.FileObjectQueue.add(FileObjectQueue.java:58)
      

      Workaround

      In several reports, deleting the contents the localq directory on each node and restarting Jira seems to resolve the errors.

      The localq files are found in the following path:

      • <local-home-directory>/localq

      Each file will be named similarly to the ones below:

      queue_Node2_0_f5f366263dcc357e2720042f33286f8f
      queue_Node2_9_f5f366263dcc357e2720042f33286f8f
      queue_Node3_0_bbb747b1516b5225f1ec1c65887b39fc
      queue_Node3_9_bbb747b1516b5225f1ec1c65887b39fc  

      Investigation

      In general localq queue files should be most of the time empty. When they are overflowing there is already another problem in the system. Please use the localq stats to investigate what process is the main source of cache replication events.

      1. In the error message find which type of queues are overflowing: via copy or via invalidation:
      via copy: replicatePutsViaCopy=true

      ...LocalQCacheOp{cacheName='com.atlassian.jira.cluster.dbr.DBRMessage', action=PUT, key=DBR, value == null ? false, replicatePutsViaCopy=true,...
      

      via invalidation: replicatePutsViaCopy=false

      2. Now search in the log for the last entry for total localq stats:
      via copy

      cat atlassian-jira.log | grep JIRA-STATS | grep LOCALQ | grep VIA-COPY

      via invalidation

      cat atlassian-jira.log | grep JIRA-STATS | grep LOCALQ | grep VIA-INVALIDATION

      3. Get the json part of the last log line:

      [LOCALQ] [VIA-INVALIDATION] Cache replication queue stats per node: node2 total stats: 
       {
        "timestampMillis": 1610720566437,
        "nodeId": "node2",
        "queueSize": 0,
        "startQueueSize": 0,
        ...

      4. Look at the last section in the above json log:

      "addCounterByCacheTopN": {
          "UserPropertyManager.ApplicationUser.cache": 543,
          "com.atlassian.jira.crowd.embedded.ofbiz.LazyOfBizUserCache.userCache": 291,
          "com.atlassian.servicedesk.internal.feature.queue.ProjectStateCacheManagerImpl.sdProjectStateCache": 
          "com.atlassian.gadgets.renderer.internal.cache.messageBundles": 153,
          "com.atlassian.gadgets.renderer.internal.cache.httpResponses": 94,
          "DbBackedCachedPropertySetManager.jira.properties.cache": 66,
          "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.childrenCache": 37,
          "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.parentsCache": 37,
          "com.atlassian.jira.crowd.embedded.ofbiz.OfBizDirectoryDao.directoryCache": 10,
          "com.atlassian.jira.workflow.CachingDraftWorkflowStore.draftWorkflowCache": 8
        }
      

      This entry shows topN threads contributing to the replication queue: [thread name]:[number of added entries]
      This may give you some hint on what is causing the problem.

      Attachments

        Issue Links

          Activity

            People

              mswinarski Maciej Swinarski (Inactive)
              dchan David Chan
              Votes:
              6 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: