Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-63137

JVM instability at one node affects whole JIRA datacenter cluster

    XMLWordPrintable

Details

    • 6.04
    • 27
    • Severity 1 - Critical
    • 265
    • Hide
      Atlassian Update – 8th December 2017

      Hi all,

      Thank you for watching and commenting on this issue. I wanted to let everyone know, that we have addressed the problem of cluster cache replication which has been identified as the main contributor to the situation when one node's instability replicates to other nodes in the cluster. 

      We have released the fix in Jira Software 7.2.12 and Jira Software 7.6.1. Soon, we are also planning to release this fix in the in bugfix releases of Jira 7.3, 7.4 and 7.5.

      We realize that there might be other contributors to node’s instability. Therefore we would like to reassure you that instability of nodes remains on our radar and we will keep reacting to any signals of this happening.

      Meanwhile, we are closing this bug as resolved.

      Thank you for your patience.

      Best regards 

      Gosia Kowalska
      Senior Product Manager, Jira Software Server
       

      Show
      Atlassian Update – 8th December 2017 Hi all, Thank you for watching and commenting on this issue. I wanted to let everyone know, that we have addressed the problem of cluster cache replication which has been identified as the main contributor to the situation when one node's instability replicates to other nodes in the cluster.  We have released the fix in Jira Software 7.2.12 and Jira Software 7.6.1. Soon, we are also planning to release this fix in the in bugfix releases of Jira 7.3, 7.4 and 7.5. We realize that there might be other contributors to node’s instability. Therefore we would like to reassure you that instability of nodes remains on our radar and we will keep reacting to any signals of this happening. Meanwhile, we are closing this bug as resolved. Thank you for your patience. Best regards  Gosia Kowalska Senior Product Manager, Jira Software Server  

    Description

      Summary

      JVM instability at one node affects whole JIRA datacenter cluster. It is possible that an OOME causes the entire cluster down.

      Environment

      • JIRA datacenter with multiple nodes
        Node A hammered by memory intensive operations until it gets into OOME state. However, at this state node is not technically down and still registered as an 'Active' member in the cluster but not processing request either other than Garbage Collection.
        Node B still consider node A as 'Active' so it kept performing cache synchronisation to Node A which not responding to the request and put Node B in stale position. 

      Diagnostic 

      Node B's thread dump shows lots of thread performing cache sync

      "RMI TCP Connection(401971)-10.50.226.97" #1309319 daemon prio=5 os_prio=0 tid=0x00007f0f2c048000 nid=0x1447b runnable [0x00007f0ee8338000] java.lang.Thread.State: RUNNABLE
       at java.net.SocketInputStream.socketRead0(Native Method)
       at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
       at java.net.SocketInputStream.read(SocketInputStream.java:170)
       at java.net.SocketInputStream.read(SocketInputStream.java:141)
       at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
       at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x000000051a686650> (a java.io.BufferedInputStream)
       at java.io.FilterInputStream.read(FilterInputStream.java:83)
       at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:550)
       at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
       at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$79(TCPTransport.java:683)
       at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$3/349671394.run(Unknown Source)
       at java.security.AccessController.doPrivileged(Native Method)
       at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      

      Expected 

      When a node goes into OOME or other heavy GC loop (hence not processing request) it should be evicted out of cluster other nodes not affected.

      Actual

      Node B can hang

      Workaround

      Restart/shutdown node that is hitting into OOME

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              vkharisma vkharisma (Inactive)
              Votes:
              39 Vote for this issue
              Watchers:
              66 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: