Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-66237

ehcache.listener.socketTimeoutMillis is not used during Naming.lookup of CachePeer

      Summary

      In Jira Data Center during cache replication value of ehcache.listener.socketTimeoutMillis from clustered.properties(or a default value) should be used as read timeout for remote RMI calls to other nodes in cluster. Instead an infinity is used. Problems with communication with one node can bring entire cluster down.

      Environment

      • JIRA datacenter with multiple nodes
        Node A is unresponsive because of extremely high load or high memory pressure or any other condition that makes it unresponsive. However, at this state node is not technically down and still registered as an 'Active' member in the cluster but not processing request either.
        Node B still consider node A as 'Active' so it keeps performing cache synchronisation to Node A which not responding to the request and put Node B in stale position. 

      Diagnostic 

      Node B's thread dump shows lots of thread performing cache sync. Those threads tend to hang on java.rmi.Naming.lookup()

       java.lang.Thread.State: RUNNABLE
              at java.net.SocketInputStream.socketRead0(Native Method)
              at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
              at java.net.SocketInputStream.read(SocketInputStream.java:170)
              at java.net.SocketInputStream.read(SocketInputStream.java:141)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
              - locked <0x00000004b8012be8> (a java.io.BufferedInputStream)
              at sun.rmi.transport.tcp.TCPConnection.isDead(TCPConnection.java:192)
              at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:191)
              at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:342)
              at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)
              at java.rmi.Naming.lookup(Naming.java:101)
              at net.sf.ehcache.distribution.RMICacheManagerPeerProvider.lookupRemoteCachePeer(RMICacheManagerPeerProvider.java:127)
              at com.atlassian.jira.cluster.distribution.JiraCacheManagerPeerProvider.listRemoteCachePeers(JiraCacheManagerPeerProvider.java:93)
              at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.listRemoteCachePeers(RMISynchronousCacheReplicator.java:335)
              at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.replicateRemovalNotification(RMISynchronousCacheReplicator.java:239)
              at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.notifyElementRemoved(RMISynchronousCacheReplicator.java:229)
              at net.sf.ehcache.event.RegisteredEventListeners.internalNotifyElementRemoved(RegisteredEventListeners.java:144)
              at net.sf.ehcache.event.RegisteredEventListeners.notifyElementRemoved(RegisteredEventListeners.java:124)
              at net.sf.ehcache.Cache.notifyRemoveInternalListeners(Cache.java:2322)
              at net.sf.ehcache.Cache.removeInternal(Cache.java:2305)
              at net.sf.ehcache.Cache.remove(Cache.java:2207)
              at net.sf.ehcache.Cache.remove(Cache.java:2125)
              at net.sf.ehcache.constructs.EhcacheDecoratorAdapter.remove(EhcacheDecoratorAdapter.java:154)
              at com.atlassian.cache.ehcache.LoadingCache.remove(LoadingCache.java:208)
              at com.atlassian.cache.ehcache.DelegatingCache.remove(DelegatingCache.java:134)
      

      Expected 

      Requests performing java.rmi.Naming.lookup() operations will throw Exception after specified timeout.

      Actual

      Requests performing java.rmi.Naming.lookup() hang indefinitely stalling Node B.

      Workaround

      Restart/shutdown node that is unresponsive

            [JRASERVER-66237] ehcache.listener.socketTimeoutMillis is not used during Naming.lookup of CachePeer

            set-jac-bot made changes -
            Bugfix Automation Bot made changes -
            Minimum Version New: 7.02
            Owen made changes -
            Workflow Original: JAC Bug Workflow v2 [ 2847768 ] New: JAC Bug Workflow v3 [ 2929684 ]
            Status Original: Resolved [ 5 ] New: Closed [ 6 ]
            Owen made changes -
            Symptom Severity Original: Critical [ 14430 ] New: Severity 1 - Critical [ 15830 ]
            Owen made changes -
            Workflow Original: JIRA Bug Workflow w Kanban v7 - Restricted [ 2589489 ] New: JAC Bug Workflow v2 [ 2847768 ]
            Ignat (Inactive) made changes -
            Workflow Original: JIRA Bug Workflow w Kanban v6 - Restricted [ 2471410 ] New: JIRA Bug Workflow w Kanban v7 - Restricted [ 2589489 ]
            Karol Lopacinski made changes -
            Remote Link New: This issue links to "Page (Extranet)" [ 335320 ]
            Andriy Yakovlev [Atlassian] made changes -
            Link New: This issue relates to JRASERVER-66369 [ JRASERVER-66369 ]
            Karol Lopacinski made changes -
            Remote Link New: This issue links to "Page (Extranet)" [ 335001 ]
            Piotr Ackermann (Inactive) made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Soaking [ 10041 ] New: Resolved [ 5 ]

              Unassigned Unassigned
              pbugalski Pawel Bugalski (Inactive)
              Affected customers:
              0 This affects my team
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: