Details
-
Bug
-
Resolution: Fixed
-
Low
-
7.2.11, 7.5.3
-
7.02
-
Severity 1 - Critical
-
Description
Summary
In Jira Data Center during cache replication value of ehcache.listener.socketTimeoutMillis from clustered.properties (or a default value) should be used as read timeout for remote RMI calls to other nodes in cluster. Instead an infinity is used. Problems with communication with one node can bring entire cluster down.
Environment
- JIRA datacenter with multiple nodes
Node A is unresponsive because of extremely high load or high memory pressure or any other condition that makes it unresponsive. However, at this state node is not technically down and still registered as an 'Active' member in the cluster but not processing request either.
Node B still consider node A as 'Active' so it keeps performing cache synchronisation to Node A which not responding to the request and put Node B in stale position.
Symptoms
Expected behaviour
TCP and RMI handshakes will throw an exception after the specified timeout has passed.
Workarounds
Restart or gracefully shutdown the unresponsive node.
Note on fix
It makes EhCache replication use finite (default 5s) timeouts for TCP and RMI handshakes during cache replication.
Attachments
Issue Links
- is related to
-
JRASERVER-66237 ehcache.listener.socketTimeoutMillis is not used during Naming.lookup of CachePeer
- Closed
- relates to
-
JRASERVER-63137 JVM instability at one node affects whole JIRA datacenter cluster
- Closed
-
DELTA-127 Loading...
-
DELTA-148 Loading...