[JRASERVER-70443] NodeReindexServiceThread can stop checking messages

Type: Bug
Resolution: Fixed
Priority: High (View bug fix roadmap)
Fix Version/s: 9.1.0
Affects Version/s: 7.6.9, 8.4.2, 8.5.1, 8.5.4
Component/s: Data Center - Other
Labels:

Introduced in Version:
7.06
Support reference count:
31
Symptom Severity:
Severity 2 - Major
UIS:
47
Bug Fix Policy:
View Atlassian Server bug fix policy

Problem:

NodeReindexServiceThread is capable of entering a state where it is no longer checking messages. This can cause inconsistency between JIRA Data Center Nodes.

Customer situation:

In a standby environment, during the cutover, the snapshot was recovered but NodeReindexServiceThread was not checking messages, causing the nodes to get behind.

Environment

JIRA Data Center

Expected Results

NodeReindexServiceThread:thread-1 Timed_waiting in a parkNanos state, sleeping until wake up to check messages again:

node2

"NodeReindexServiceThread:thread-1" #128 prio=5 os_prio=0 tid=0x00007f8feb238800 nid=0x8379 waiting on condition [0x00007f8fd1bf9000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000006869304c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

What was observed instead:

Thread dumps showed that NodeReindexServiceThread:thread-1 was Waiting in a park state:

node1

"NodeReindexServiceThread:thread-1" #110 prio=5 os_prio=0 tid=0x00007ff06f215800 nid=0x18f46 waiting on condition [0x00007ff04b429000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000686202370> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Perceived results:

Health Check detects that index data is behind the Database date and reports the delay, that keeps increasing as NodeReindexServiceThread:thread-1 is not getting the messages.

Notes

As part of Lucene changes in Jira 8.x, code was improved in this area, which makes the problem less likely to occur, eg. com.atlassian.jira.index.ha.DefaultNodeReindexService#reIndex now catches the Throwable exception.

Workaround:

Stop Jira in all nodes at the same time
Start the first node. Wait until the snapshot is restored and then restart the node.
After the restart and after NodeReindexServiceThread:thread-1 updates the index, start the second node
If there are more nodes, start them sequentially.

is related to

JRASERVER-72099 Index snapshot restore fails and Jira does not start in Disaster Recovery mode

Closed

JRASERVER-66557 ClusterMessageHandlerServiceThread can stop checking messages if Throwable is encountered

Gathering Impact

relates to

JRASERVER-72125 Index replication service is paused indefinitely after failing to obtain an index snapshot from another node

Closed

links to

Index falling behind on one node but Index recovery isn't working - all after a DR event

mentioned in: Page Failed to load; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(13 mentioned in)

Jutamat (Kate) Phothisitthisak added a comment - 24/Oct/2020 1:05 AM

We are running Jira Data Center v. 8.4.3 and this issue happen very frequency now when user get difference data when load same object i.e. board, filter, Sprint

in same time. Please provide fix / patching as soon as possible. This issue submitted as ticket GHS-203523

Jutamat (Kate) Phothisitthisak added a comment - 24/Oct/2020 1:05 AM We are running Jira Data Center v. 8.4.3 and this issue happen very frequency now when user get difference data when load same object i.e. board, filter, Sprint in same time. Please provide fix / patching as soon as possible. This issue submitted as ticket GHS-203523

Details

Description

Problem:

Customer situation:

Environment

Expected Results

What was observed instead:

Perceived results:

Notes

Workaround:

Attachments

Issue Links

Forms

Activity

Collapse comment: Jutamat (Kate) Phothisitthisak added a comment - 24/Oct/2020 1:05 AM

Expand comment: Jutamat (Kate) Phothisitthisak added a comment - 24/Oct/2020 1:05 AM

People

Dates