Problem:
NodeReindexServiceThread is capable of entering a state where it is no longer checking messages. This can cause inconsistency between JIRA Data Center Nodes.
Customer situation:
In a standby environment, during the cutover, the snapshot was recovered but NodeReindexServiceThread was not checking messages, causing the nodes to get behind.
Environment
Expected Results
NodeReindexServiceThread:thread-1 Timed_waiting in a parkNanos state, sleeping until wake up to check messages again:
"NodeReindexServiceThread:thread-1" #128 prio=5 os_prio=0 tid=0x00007f8feb238800 nid=0x8379 waiting on condition [0x00007f8fd1bf9000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006869304c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
What was observed instead:
- Thread dumps showed that NodeReindexServiceThread:thread-1 was Waiting in a park state:
"NodeReindexServiceThread:thread-1" #110 prio=5 os_prio=0 tid=0x00007ff06f215800 nid=0x18f46 waiting on condition [0x00007ff04b429000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000686202370> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Perceived results:
- Health Check detects that index data is behind the Database date and reports the delay, that keeps increasing as NodeReindexServiceThread:thread-1 is not getting the messages.
Notes
As part of Lucene changes in Jira 8.x, code was improved in this area, which makes the problem less likely to occur, eg. com.atlassian.jira.index.ha.DefaultNodeReindexService#reIndex now catches the Throwable exception.
Workaround:
- Stop Jira in all nodes at the same time
- Start the first node. Wait until the snapshot is restored and then restart the node.
- After the restart and after NodeReindexServiceThread:thread-1 updates the index, start the second node
- If there are more nodes, start them sequentially.