[JRASERVER-70443] NodeReindexServiceThread can stop checking messages

Type: Bug
Resolution: Fixed
Priority: High (View bug fix roadmap)
Fix Version/s: 9.1.0
Affects Version/s: 7.6.9, 8.4.2, 8.5.1, 8.5.4
Component/s: Data Center - Other
Labels:

Introduced in Version:
7.06
Support reference count:
31
Symptom Severity:
Severity 2 - Major
UIS:
47
Bug Fix Policy:
View Atlassian Server bug fix policy

Problem:

NodeReindexServiceThread is capable of entering a state where it is no longer checking messages. This can cause inconsistency between JIRA Data Center Nodes.

Customer situation:

In a standby environment, during the cutover, the snapshot was recovered but NodeReindexServiceThread was not checking messages, causing the nodes to get behind.

Environment

JIRA Data Center

Expected Results

NodeReindexServiceThread:thread-1 Timed_waiting in a parkNanos state, sleeping until wake up to check messages again:

node2

"NodeReindexServiceThread:thread-1" #128 prio=5 os_prio=0 tid=0x00007f8feb238800 nid=0x8379 waiting on condition [0x00007f8fd1bf9000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000006869304c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

What was observed instead:

Thread dumps showed that NodeReindexServiceThread:thread-1 was Waiting in a park state:

node1

"NodeReindexServiceThread:thread-1" #110 prio=5 os_prio=0 tid=0x00007ff06f215800 nid=0x18f46 waiting on condition [0x00007ff04b429000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000686202370> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Perceived results:

Health Check detects that index data is behind the Database date and reports the delay, that keeps increasing as NodeReindexServiceThread:thread-1 is not getting the messages.

Notes

As part of Lucene changes in Jira 8.x, code was improved in this area, which makes the problem less likely to occur, eg. com.atlassian.jira.index.ha.DefaultNodeReindexService#reIndex now catches the Throwable exception.

Workaround:

Stop Jira in all nodes at the same time
Start the first node. Wait until the snapshot is restored and then restart the node.
After the restart and after NodeReindexServiceThread:thread-1 updates the index, start the second node
If there are more nodes, start them sequentially.

is related to

JRASERVER-72099 Index snapshot restore fails and Jira does not start in Disaster Recovery mode

Closed

JRASERVER-66557 ClusterMessageHandlerServiceThread can stop checking messages if Throwable is encountered

Gathering Impact

relates to

JRASERVER-72125 Index replication service is paused indefinitely after failing to obtain an index snapshot from another node

Closed

links to

Index falling behind on one node but Index recovery isn't working - all after a DR event

mentioned in: Page Failed to load; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(13 mentioned in)

Maciej Swinarski (Inactive) made changes - 28/Feb/2023 3:03 PM

Assignee

New: Maciej Swinarski [ mswinarski ]

Maciej Swinarski (Inactive) made changes - 28/Feb/2023 3:01 PM

Resolution		New: Fixed [ 1 ]
Status	Original: Gathering Impact [ 12072 ]	New: Closed [ 6 ]

Maciej Swinarski (Inactive) made changes - 28/Feb/2023 2:59 PM

Labels

Original: delta_reindex pse-request

New: delta_reindex fixedByDelta pse-request

Maciej Swinarski (Inactive) made changes - 28/Feb/2023 2:59 PM

Fix Version/s

New: 9.1.0 [ 99995 ]

Rodrigo Martinez made changes - 28/Feb/2023 1:54 PM

Link

New: This issue relates to ~~JRASERVER-72125~~ [ ~~JRASERVER-72125~~ ]

SET Analytics Bot made changes - 24/Feb/2023 2:07 AM

UIS

Original: 46

New: 47

SET Analytics Bot made changes - 23/Feb/2023 2:07 AM

UIS

Original: 47

New: 46

SET Analytics Bot made changes - 19/Feb/2023 2:06 AM

UIS

Original: 23

New: 47

SET Analytics Bot made changes - 03/Feb/2023 2:10 AM

UIS

Original: 24

New: 23

SET Analytics Bot made changes - 20/Jan/2023 2:10 AM

UIS

Original: 23

New: 24

Assignee:: Maciej Swinarski (Inactive)

Reporter:: Murakami

Affected customers:: 9 This affects my team

Watchers:: 24 Start watching this issue

Created:: 03/Jan/2020 12:26 AM

Updated:: 28/Feb/2023 3:03 PM

Resolved:: 28/Feb/2023 3:01 PM

Jira Data Center

Details

Description

Problem:

Customer situation:

Environment

Expected Results

What was observed instead:

Perceived results:

Notes

Workaround:

Attachments

Issue Links

Forms

Activity

People

Dates