Details
-
Bug
-
Resolution: Fixed
-
Medium
-
7.13.18, 8.13.8, 8.17.1, 8.5.19, 8.20.7
-
7.13
-
35
-
Severity 2 - Major
-
146
-
-
Description
Issue Summary
At times the replication queue files can become quite large, resulting in errors with cache replication. This has happened across a number of tickets and root cause is still not known. A common theme is the size of localq files.
Steps to Reproduce
n/a
Expected Results
Cache replication continues to work without errors
Actual Results
Cache replication fails with these errors in logs:
2021-09-20 19:03:01,157 http-nio-8080-exec-84 ERROR Jira 1303x259x4 11g2toz 10.12.200.110,10.12.200.110,127.0.0.1 /rest/api/2/issue/TEST-13214/remotelink [c.a.j.c.distribution.localq.LocalQCacheManager] [LOCALQ] Critical state of local cache replication queue - cannot add: LocalQCacheOp{cacheName='com.atlassian.jira.cluster.dbr.DBRMessage', action=PUT, key=DBR, value == null ? false, replicatePutsViaCopy=true, creationTimeInMillis=1632164581156} to queue: [queueId=queue_prodnode2_1_9c79a1ab55dcc55cc4a830cfc917b8a9_put, queuePath=/var/atlassian/jira/application-data/localq/queue_prodnode2_1_9c79a1ab55dcc55cc4a830cfc917b8a9_put], error: Failed to add entry. com.squareup.tape.FileException: Failed to add entry. at com.squareup.tape.FileObjectQueue.add(FileObjectQueue.java:61) at com.atlassian.jira.cluster.distribution.localq.tape.TapeLocalQCacheOpQueue.add(TapeLocalQCacheOpQueue.java:167) at com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpQueueWithStats.add(LocalQCacheOpQueueWithStats.java:121) at com.atlassian.jira.cluster.distribution.localq.LocalQCacheManager.addToQueue(LocalQCacheManager.java:401) at com.atlassian.jira.cluster.distribution.localq.LocalQCacheManager.addToAllQueues(LocalQCacheManager.java:382) at com.atlassian.jira.cluster.distribution.localq.LocalQCacheReplicator.replicateToQueue(LocalQCacheReplicator.java:87) at com.atlassian.jira.cluster.distribution.localq.LocalQCacheReplicator.replicatePutNotification(LocalQCacheReplicator.java:67) at com.atlassian.jira.cluster.cache.ehcache.AbstractJiraCacheReplicator.notifyElementUpdated(AbstractJiraCacheReplicator.java:123) at net.sf.ehcache.event.RegisteredEventListeners.internalNotifyElementUpdated(RegisteredEventListeners.java:228) at net.sf.ehcache.event.RegisteredEventListeners.notifyElementUpdated(RegisteredEventListeners.java:206) at net.sf.ehcache.Cache.notifyPutInternalListeners(Cache.java:1646) at net.sf.ehcache.Cache.putInternal(Cache.java:1618) at net.sf.ehcache.Cache.put(Cache.java:1543) at net.sf.ehcache.Cache.put(Cache.java:1508) at com.atlassian.cache.ehcache.DelegatingCache.put(DelegatingCache.java:93) at com.atlassian.jira.cache.DeferredReplicationCache.lambda$put$0(DeferredReplicationCache.java:60) at com.atlassian.jira.cluster.cache.ehcache.BlockingParallelCacheReplicator.runDeferred(BlockingParallelCacheReplicator.java:172) at com.atlassian.jira.cache.DeferredReplicationCache.put(DeferredReplicationCache.java:59) at com.atlassian.jira.cluster.dbr.transport.DefaultDBRTransportManager.send(DefaultDBRTransportManager.java:45) at com.atlassian.jira.cluster.dbr.DefaultDBRSender.sendDBRMessage(DefaultDBRSender.java:135) at com.atlassian.jira.cluster.dbr.DefaultDBRSender.sendUpdateWithRelated(DefaultDBRSender.java:90) at com.atlassian.jira.cluster.dbr.AsyncDBRSender.lambda$sendUpdateWithRelated$0(AsyncDBRSender.java:68) at co.elastic.apm.agent.concurrent.JavaConcurrent$RunnableLambdaWrapper.run(JavaConcurrent.java:220) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.io.IOException: Invalid argument at java.base/java.io.RandomAccessFile.setLength(Native Method) at com.squareup.tape.QueueFile.setLength(QueueFile.java:406) at com.squareup.tape.QueueFile.expandIfNecessary(QueueFile.java:375) at com.squareup.tape.QueueFile.add(QueueFile.java:305) at com.squareup.tape.FileObjectQueue.add(FileObjectQueue.java:58)
Workaround
In several reports, deleting the contents the localq directory on each node and restarting Jira seems to resolve the errors.
The localq files are found in the following path:
- <local-home-directory>/localq
Each file will be named similarly to the ones below:
queue_Node2_0_f5f366263dcc357e2720042f33286f8f queue_Node2_9_f5f366263dcc357e2720042f33286f8f queue_Node3_0_bbb747b1516b5225f1ec1c65887b39fc queue_Node3_9_bbb747b1516b5225f1ec1c65887b39fc
Investigation
In general localq queue files should be most of the time empty. When they are overflowing there is already another problem in the system. Please use the localq stats to investigate what process is the main source of cache replication events.
1. In the error message find which type of queues are overflowing: via copy or via invalidation:
via copy: replicatePutsViaCopy=true
...LocalQCacheOp{cacheName='com.atlassian.jira.cluster.dbr.DBRMessage', action=PUT, key=DBR, value == null ? false, replicatePutsViaCopy=true,...
via invalidation: replicatePutsViaCopy=false
2. Now search in the log for the last entry for total localq stats:
via copy
cat atlassian-jira.log | grep JIRA-STATS | grep LOCALQ | grep VIA-COPY
via invalidation
cat atlassian-jira.log | grep JIRA-STATS | grep LOCALQ | grep VIA-INVALIDATION
3. Get the json part of the last log line:
[LOCALQ] [VIA-INVALIDATION] Cache replication queue stats per node: node2 total stats: { "timestampMillis": 1610720566437, "nodeId": "node2", "queueSize": 0, "startQueueSize": 0, ...
4. Look at the last section in the above json log:
"addCounterByCacheTopN": { "UserPropertyManager.ApplicationUser.cache": 543, "com.atlassian.jira.crowd.embedded.ofbiz.LazyOfBizUserCache.userCache": 291, "com.atlassian.servicedesk.internal.feature.queue.ProjectStateCacheManagerImpl.sdProjectStateCache": "com.atlassian.gadgets.renderer.internal.cache.messageBundles": 153, "com.atlassian.gadgets.renderer.internal.cache.httpResponses": 94, "DbBackedCachedPropertySetManager.jira.properties.cache": 66, "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.childrenCache": 37, "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.parentsCache": 37, "com.atlassian.jira.crowd.embedded.ofbiz.OfBizDirectoryDao.directoryCache": 10, "com.atlassian.jira.workflow.CachingDraftWorkflowStore.draftWorkflowCache": 8 }
This entry shows topN threads contributing to the replication queue: [thread name]:[number of added entries]
This may give you some hint on what is causing the problem.
Attachments
Issue Links
- relates to
-
JRASERVER-73252 Restarting the database causes cache replication issues
- Closed
-
DELTA-1131 Loading...
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...