-
Bug
-
Resolution: Fixed
-
Low (View bug fix roadmap)
-
7.5.1, 7.2.11
-
7.02
-
1
-
Severity 1 - Critical
-
2
-
Summary
In Jira Data Center during cache replication value of ehcache.listener.socketTimeoutMillis from clustered.properties(or a default value) should be used as read timeout for remote RMI calls to other nodes in cluster. Instead an infinity is used. Problems with communication with one node can bring entire cluster down.
Environment
- JIRA datacenter with multiple nodes
Node A is unresponsive because of extremely high load or high memory pressure or any other condition that makes it unresponsive. However, at this state node is not technically down and still registered as an 'Active' member in the cluster but not processing request either.
Node B still consider node A as 'Active' so it keeps performing cache synchronisation to Node A which not responding to the request and put Node B in stale position.
Diagnostic
Node B's thread dump shows lots of thread performing cache sync. Those threads tend to hang on java.rmi.Naming.lookup()
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x00000004b8012be8> (a java.io.BufferedInputStream)
at sun.rmi.transport.tcp.TCPConnection.isDead(TCPConnection.java:192)
at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:191)
at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:342)
at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)
at java.rmi.Naming.lookup(Naming.java:101)
at net.sf.ehcache.distribution.RMICacheManagerPeerProvider.lookupRemoteCachePeer(RMICacheManagerPeerProvider.java:127)
at com.atlassian.jira.cluster.distribution.JiraCacheManagerPeerProvider.listRemoteCachePeers(JiraCacheManagerPeerProvider.java:93)
at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.listRemoteCachePeers(RMISynchronousCacheReplicator.java:335)
at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.replicateRemovalNotification(RMISynchronousCacheReplicator.java:239)
at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.notifyElementRemoved(RMISynchronousCacheReplicator.java:229)
at net.sf.ehcache.event.RegisteredEventListeners.internalNotifyElementRemoved(RegisteredEventListeners.java:144)
at net.sf.ehcache.event.RegisteredEventListeners.notifyElementRemoved(RegisteredEventListeners.java:124)
at net.sf.ehcache.Cache.notifyRemoveInternalListeners(Cache.java:2322)
at net.sf.ehcache.Cache.removeInternal(Cache.java:2305)
at net.sf.ehcache.Cache.remove(Cache.java:2207)
at net.sf.ehcache.Cache.remove(Cache.java:2125)
at net.sf.ehcache.constructs.EhcacheDecoratorAdapter.remove(EhcacheDecoratorAdapter.java:154)
at com.atlassian.cache.ehcache.LoadingCache.remove(LoadingCache.java:208)
at com.atlassian.cache.ehcache.DelegatingCache.remove(DelegatingCache.java:134)
Expected
Requests performing java.rmi.Naming.lookup() operations will throw Exception after specified timeout.
Actual
Requests performing java.rmi.Naming.lookup() hang indefinitely stalling Node B.
Workaround
Restart/shutdown node that is unresponsive
- causes
-
JRASERVER-63137 JVM instability at one node affects whole JIRA datacenter cluster
-
- Closed
-
-
JRASERVER-64267 Removing Data Center node breaks JIRA login for around 10 minutes
-
- Closed
-
- relates to
-
JRASERVER-66369 ehcache.listener.socketTimeoutMillis is not used for TCP/RMI handshakes
-
- Closed
-
- is cloned from
-
DELTA-135 You do not have permission to view this issue