Loading...

Type: Bug
Resolution: Fixed
Priority: Medium
Fix Version/s: 8.22.1, 8.13.20, 8.20.8, 9.0.0
Affects Version/s: 7.13.18, 8.13.8, 8.17.1, 8.5.19, 8.20.7
Component/s: Data Center - Node replication
Labels:
- delta
- fixedByDelta

Introduced in Version:
7.13
Support reference count:
35
Symptom Severity:
Severity 2 - Major
UIS:
146
Bug Fix Policy:
View Atlassian Server bug fix policy
Current Status:

Hide

We have strong evidence that ~~JRASERVER-73252~~ might have been the root cause of this issue. Note that ~~JRASERVER-73252~~ was not only affected by DB restart, but could also be a result of any temporary networking problems between Jira nodes and DB.

Show
We have strong evidence that JRASERVER-73252 might have been the root cause of this issue. Note that JRASERVER-73252 was not only affected by DB restart, but could also be a result of any temporary networking problems between Jira nodes and DB.

Issue Summary

At times the replication queue files can become quite large, resulting in errors with cache replication. This has happened across a number of tickets and root cause is still not known. A common theme is the size of localq files.

Steps to Reproduce

n/a

Expected Results

Cache replication continues to work without errors

Actual Results

Cache replication fails with these errors in logs:

2021-09-20 19:03:01,157 http-nio-8080-exec-84 ERROR Jira 1303x259x4 11g2toz 10.12.200.110,10.12.200.110,127.0.0.1 /rest/api/2/issue/TEST-13214/remotelink [c.a.j.c.distribution.localq.LocalQCacheManager] [LOCALQ] Critical state of local cache replication queue - cannot add: LocalQCacheOp{cacheName='com.atlassian.jira.cluster.dbr.DBRMessage', action=PUT, key=DBR, value == null ? false, replicatePutsViaCopy=true, creationTimeInMillis=1632164581156} to queue: [queueId=queue_prodnode2_1_9c79a1ab55dcc55cc4a830cfc917b8a9_put, queuePath=/var/atlassian/jira/application-data/localq/queue_prodnode2_1_9c79a1ab55dcc55cc4a830cfc917b8a9_put], error: Failed to add entry.
com.squareup.tape.FileException: Failed to add entry.
	at com.squareup.tape.FileObjectQueue.add(FileObjectQueue.java:61)
	at com.atlassian.jira.cluster.distribution.localq.tape.TapeLocalQCacheOpQueue.add(TapeLocalQCacheOpQueue.java:167)
	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpQueueWithStats.add(LocalQCacheOpQueueWithStats.java:121)
	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheManager.addToQueue(LocalQCacheManager.java:401)
	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheManager.addToAllQueues(LocalQCacheManager.java:382)
	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheReplicator.replicateToQueue(LocalQCacheReplicator.java:87)
	at com.atlassian.jira.cluster.distribution.localq.LocalQCacheReplicator.replicatePutNotification(LocalQCacheReplicator.java:67)
	at com.atlassian.jira.cluster.cache.ehcache.AbstractJiraCacheReplicator.notifyElementUpdated(AbstractJiraCacheReplicator.java:123)
	at net.sf.ehcache.event.RegisteredEventListeners.internalNotifyElementUpdated(RegisteredEventListeners.java:228)
	at net.sf.ehcache.event.RegisteredEventListeners.notifyElementUpdated(RegisteredEventListeners.java:206)
	at net.sf.ehcache.Cache.notifyPutInternalListeners(Cache.java:1646)
	at net.sf.ehcache.Cache.putInternal(Cache.java:1618)
	at net.sf.ehcache.Cache.put(Cache.java:1543)
	at net.sf.ehcache.Cache.put(Cache.java:1508)
	at com.atlassian.cache.ehcache.DelegatingCache.put(DelegatingCache.java:93)
	at com.atlassian.jira.cache.DeferredReplicationCache.lambda$put$0(DeferredReplicationCache.java:60)
	at com.atlassian.jira.cluster.cache.ehcache.BlockingParallelCacheReplicator.runDeferred(BlockingParallelCacheReplicator.java:172)
	at com.atlassian.jira.cache.DeferredReplicationCache.put(DeferredReplicationCache.java:59)
	at com.atlassian.jira.cluster.dbr.transport.DefaultDBRTransportManager.send(DefaultDBRTransportManager.java:45)
	at com.atlassian.jira.cluster.dbr.DefaultDBRSender.sendDBRMessage(DefaultDBRSender.java:135)
	at com.atlassian.jira.cluster.dbr.DefaultDBRSender.sendUpdateWithRelated(DefaultDBRSender.java:90)
	at com.atlassian.jira.cluster.dbr.AsyncDBRSender.lambda$sendUpdateWithRelated$0(AsyncDBRSender.java:68)
	at co.elastic.apm.agent.concurrent.JavaConcurrent$RunnableLambdaWrapper.run(JavaConcurrent.java:220)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.io.IOException: Invalid argument
	at java.base/java.io.RandomAccessFile.setLength(Native Method)
	at com.squareup.tape.QueueFile.setLength(QueueFile.java:406)
	at com.squareup.tape.QueueFile.expandIfNecessary(QueueFile.java:375)
	at com.squareup.tape.QueueFile.add(QueueFile.java:305)
	at com.squareup.tape.FileObjectQueue.add(FileObjectQueue.java:58)

Workaround

In several reports, deleting the contents the localq directory on each node and restarting Jira seems to resolve the errors.

The localq files are found in the following path:

<local-home-directory>/localq

Each file will be named similarly to the ones below:

queue_Node2_0_f5f366263dcc357e2720042f33286f8f
queue_Node2_9_f5f366263dcc357e2720042f33286f8f
queue_Node3_0_bbb747b1516b5225f1ec1c65887b39fc
queue_Node3_9_bbb747b1516b5225f1ec1c65887b39fc

Investigation

In general localq queue files should be most of the time empty. When they are overflowing there is already another problem in the system. Please use the localq stats to investigate what process is the main source of cache replication events.

1. In the error message find which type of queues are overflowing: via copy or via invalidation:
via copy: replicatePutsViaCopy=true

...LocalQCacheOp{cacheName='com.atlassian.jira.cluster.dbr.DBRMessage', action=PUT, key=DBR, value == null ? false, replicatePutsViaCopy=true,...

via invalidation: replicatePutsViaCopy=false

2. Now search in the log for the last entry for total localq stats:
via copy

cat atlassian-jira.log | grep JIRA-STATS | grep LOCALQ | grep VIA-COPY

via invalidation

cat atlassian-jira.log | grep JIRA-STATS | grep LOCALQ | grep VIA-INVALIDATION

3. Get the json part of the last log line:

[LOCALQ] [VIA-INVALIDATION] Cache replication queue stats per node: node2 total stats: 
 {
  "timestampMillis": 1610720566437,
  "nodeId": "node2",
  "queueSize": 0,
  "startQueueSize": 0,
  ...

4. Look at the last section in the above json log:

"addCounterByCacheTopN": {
    "UserPropertyManager.ApplicationUser.cache": 543,
    "com.atlassian.jira.crowd.embedded.ofbiz.LazyOfBizUserCache.userCache": 291,
    "com.atlassian.servicedesk.internal.feature.queue.ProjectStateCacheManagerImpl.sdProjectStateCache": 
    "com.atlassian.gadgets.renderer.internal.cache.messageBundles": 153,
    "com.atlassian.gadgets.renderer.internal.cache.httpResponses": 94,
    "DbBackedCachedPropertySetManager.jira.properties.cache": 66,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.childrenCache": 37,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizInternalMembershipDao.parentsCache": 37,
    "com.atlassian.jira.crowd.embedded.ofbiz.OfBizDirectoryDao.directoryCache": 10,
    "com.atlassian.jira.workflow.CachingDraftWorkflowStore.draftWorkflowCache": 8
  }

This entry shows topN threads contributing to the replication queue: [thread name]:[number of added entries]
This may give you some hint on what is causing the problem.

relates to

JRASERVER-73252 Restarting the database causes cache replication issues

Closed

DELTA-1131 Loading...

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(3 mentioned in)

Details

Description

Issue Summary

Steps to Reproduce

Expected Results

Actual Results

Workaround

Investigation

Attachments

Issue Links

Activity

People

Dates