Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-70443

NodeReindexServiceThread can stop checking messages

XMLWordPrintable

      Problem:

      NodeReindexServiceThread is capable of entering a state where it is no longer checking messages. This can cause inconsistency between JIRA Data Center Nodes.

      Customer situation:

      In a standby environment, during the cutover, the snapshot was recovered but NodeReindexServiceThread was not checking messages, causing the nodes to get behind.

      Environment

      • JIRA Data Center

      Expected Results

      NodeReindexServiceThread:thread-1 Timed_waiting in a parkNanos state, sleeping until wake up to check messages again:

      node2
      "NodeReindexServiceThread:thread-1" #128 prio=5 os_prio=0 tid=0x00007f8feb238800 nid=0x8379 waiting on condition [0x00007f8fd1bf9000]
         java.lang.Thread.State: TIMED_WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x00000006869304c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
      	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
      	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      What was observed instead:

      • Thread dumps showed that NodeReindexServiceThread:thread-1 was Waiting in a park state:
      node1
      "NodeReindexServiceThread:thread-1" #110 prio=5 os_prio=0 tid=0x00007ff06f215800 nid=0x18f46 waiting on condition [0x00007ff04b429000]
         java.lang.Thread.State: WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x0000000686202370> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
      	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      Perceived results:

      • Health Check detects that index data is behind the Database date and reports the delay, that keeps increasing as NodeReindexServiceThread:thread-1 is not getting the messages.

      Notes

      As part of Lucene changes in Jira 8.x, code was improved in this area, which makes the problem less likely to occur, eg. com.atlassian.jira.index.ha.DefaultNodeReindexService#reIndex now catches the Throwable exception.

      Workaround:

      • Stop Jira in all nodes at the same time
      • Start the first node. Wait until the snapshot is restored and then restart the node.
      • After the restart and after NodeReindexServiceThread:thread-1 updates the index, start the second node
      • If there are more nodes, start them sequentially. 

              mswinarski Maciej Swinarski (Inactive)
              imurakami@atlassian.com Murakami
              Votes:
              9 Vote for this issue
              Watchers:
              24 Start watching this issue

                Created:
                Updated:
                Resolved: