Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-66237

ehcache.listener.socketTimeoutMillis is not used during Naming.lookup of CachePeer

    XMLWordPrintable

Details

    Description

      Summary

      In Jira Data Center during cache replication value of ehcache.listener.socketTimeoutMillis from clustered.properties(or a default value) should be used as read timeout for remote RMI calls to other nodes in cluster. Instead an infinity is used. Problems with communication with one node can bring entire cluster down.

      Environment

      • JIRA datacenter with multiple nodes
        Node A is unresponsive because of extremely high load or high memory pressure or any other condition that makes it unresponsive. However, at this state node is not technically down and still registered as an 'Active' member in the cluster but not processing request either.
        Node B still consider node A as 'Active' so it keeps performing cache synchronisation to Node A which not responding to the request and put Node B in stale position. 

      Diagnostic 

      Node B's thread dump shows lots of thread performing cache sync. Those threads tend to hang on java.rmi.Naming.lookup()

       java.lang.Thread.State: RUNNABLE
              at java.net.SocketInputStream.socketRead0(Native Method)
              at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
              at java.net.SocketInputStream.read(SocketInputStream.java:170)
              at java.net.SocketInputStream.read(SocketInputStream.java:141)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
              - locked <0x00000004b8012be8> (a java.io.BufferedInputStream)
              at sun.rmi.transport.tcp.TCPConnection.isDead(TCPConnection.java:192)
              at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:191)
              at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:342)
              at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)
              at java.rmi.Naming.lookup(Naming.java:101)
              at net.sf.ehcache.distribution.RMICacheManagerPeerProvider.lookupRemoteCachePeer(RMICacheManagerPeerProvider.java:127)
              at com.atlassian.jira.cluster.distribution.JiraCacheManagerPeerProvider.listRemoteCachePeers(JiraCacheManagerPeerProvider.java:93)
              at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.listRemoteCachePeers(RMISynchronousCacheReplicator.java:335)
              at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.replicateRemovalNotification(RMISynchronousCacheReplicator.java:239)
              at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.notifyElementRemoved(RMISynchronousCacheReplicator.java:229)
              at net.sf.ehcache.event.RegisteredEventListeners.internalNotifyElementRemoved(RegisteredEventListeners.java:144)
              at net.sf.ehcache.event.RegisteredEventListeners.notifyElementRemoved(RegisteredEventListeners.java:124)
              at net.sf.ehcache.Cache.notifyRemoveInternalListeners(Cache.java:2322)
              at net.sf.ehcache.Cache.removeInternal(Cache.java:2305)
              at net.sf.ehcache.Cache.remove(Cache.java:2207)
              at net.sf.ehcache.Cache.remove(Cache.java:2125)
              at net.sf.ehcache.constructs.EhcacheDecoratorAdapter.remove(EhcacheDecoratorAdapter.java:154)
              at com.atlassian.cache.ehcache.LoadingCache.remove(LoadingCache.java:208)
              at com.atlassian.cache.ehcache.DelegatingCache.remove(DelegatingCache.java:134)
      

      Expected 

      Requests performing java.rmi.Naming.lookup() operations will throw Exception after specified timeout.

      Actual

      Requests performing java.rmi.Naming.lookup() hang indefinitely stalling Node B.

      Workaround

      Restart/shutdown node that is unresponsive

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              pbugalski Pawel Bugalski (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: