Details
-
Bug
-
Resolution: Fixed
-
Highest
-
6.4.13, 6.4.14, 7.2.8, 7.2.10
-
6.04
-
27
-
Severity 1 - Critical
-
265
-
-
Description
Summary
JVM instability at one node affects whole JIRA datacenter cluster. It is possible that an OOME causes the entire cluster down.
Environment
- JIRA datacenter with multiple nodes
Node A hammered by memory intensive operations until it gets into OOME state. However, at this state node is not technically down and still registered as an 'Active' member in the cluster but not processing request either other than Garbage Collection.
Node B still consider node A as 'Active' so it kept performing cache synchronisation to Node A which not responding to the request and put Node B in stale position.
Diagnostic
Node B's thread dump shows lots of thread performing cache sync
"RMI TCP Connection(401971)-10.50.226.97" #1309319 daemon prio=5 os_prio=0 tid=0x00007f0f2c048000 nid=0x1447b runnable [0x00007f0ee8338000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x000000051a686650> (a java.io.BufferedInputStream) at java.io.FilterInputStream.read(FilterInputStream.java:83) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:550) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$79(TCPTransport.java:683) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$3/349671394.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Expected
When a node goes into OOME or other heavy GC loop (hence not processing request) it should be evicted out of cluster other nodes not affected.
Actual
Node B can hang
Workaround
Restart/shutdown node that is hitting into OOME
Attachments
Issue Links
- is caused by
-
JRASERVER-66237 ehcache.listener.socketTimeoutMillis is not used during Naming.lookup of CachePeer
- Closed
- is related to
-
JRASERVER-66369 ehcache.listener.socketTimeoutMillis is not used for TCP/RMI handshakes
- Closed
-
JRASERVER-63556 Implement semi-async cache replication for EhCache in DC
- Closed
- relates to
-
JRASERVER-64267 Removing Data Center node breaks JIRA login for around 10 minutes
- Closed
-
JRASERVER-65977 [Data Center] Lock contention on ehcache in DefaultGlobalPermissionManager under high load with many users
- Closed
-
JRASERVER-68548 Cluster cache replication can cause high CPU across all nodes in the cluster and require a restart
- Closed
-
JRASERVER-66393 Do not replicate caches to node that has problems receiving cache replication requests
- Closed
-
PSR-43 Loading...
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...