Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: Highest
Fix Version/s: 7.2.12, 7.3.9, 7.4.6, 7.5.4, 7.6.1
Affects Version/s: 6.4.13, 6.4.14, 7.2.8, 7.2.10
Component/s: Data Center - Other, Project Administration - Others
Labels:
- affects-server
- ee
- l1l2
- pse-request
- stability

Fixed in Long Term Support Release/s:

Download 7.6
Introduced in Version:
6.04
Support reference count:
27
Symptom Severity:
Severity 1 - Critical
UIS:
265
Bug Fix Policy:
View Atlassian Server bug fix policy
Current Status:

Hide

Atlassian Update – 8th December 2017

Hi all,

Thank you for watching and commenting on this issue. I wanted to let everyone know, that we have addressed the problem of cluster cache replication which has been identified as the main contributor to the situation when one node's instability replicates to other nodes in the cluster.

We have released the fix in Jira Software 7.2.12 and Jira Software 7.6.1. Soon, we are also planning to release this fix in the in bugfix releases of Jira 7.3, 7.4 and 7.5.

We realize that there might be other contributors to node’s instability. Therefore we would like to reassure you that instability of nodes remains on our radar and we will keep reacting to any signals of this happening.

Meanwhile, we are closing this bug as resolved.

Thank you for your patience.

Best regards

Gosia Kowalska
Senior Product Manager, Jira Software Server

Show
Atlassian Update – 8th December 2017 Hi all, Thank you for watching and commenting on this issue. I wanted to let everyone know, that we have addressed the problem of cluster cache replication which has been identified as the main contributor to the situation when one node's instability replicates to other nodes in the cluster. We have released the fix in Jira Software 7.2.12 and Jira Software 7.6.1. Soon, we are also planning to release this fix in the in bugfix releases of Jira 7.3, 7.4 and 7.5. We realize that there might be other contributors to node’s instability. Therefore we would like to reassure you that instability of nodes remains on our radar and we will keep reacting to any signals of this happening. Meanwhile, we are closing this bug as resolved. Thank you for your patience. Best regards Gosia Kowalska Senior Product Manager, Jira Software Server

Summary

JVM instability at one node affects whole JIRA datacenter cluster. It is possible that an OOME causes the entire cluster down.

Environment

JIRA datacenter with multiple nodes
Node A hammered by memory intensive operations until it gets into OOME state. However, at this state node is not technically down and still registered as an 'Active' member in the cluster but not processing request either other than Garbage Collection.
Node B still consider node A as 'Active' so it kept performing cache synchronisation to Node A which not responding to the request and put Node B in stale position.

Diagnostic

Node B's thread dump shows lots of thread performing cache sync

"RMI TCP Connection(401971)-10.50.226.97" #1309319 daemon prio=5 os_prio=0 tid=0x00007f0f2c048000 nid=0x1447b runnable [0x00007f0ee8338000] java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
 at java.net.SocketInputStream.read(SocketInputStream.java:170)
 at java.net.SocketInputStream.read(SocketInputStream.java:141)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x000000051a686650> (a java.io.BufferedInputStream)
 at java.io.FilterInputStream.read(FilterInputStream.java:83)
 at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:550)
 at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
 at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$79(TCPTransport.java:683)
 at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$3/349671394.run(Unknown Source)
 at java.security.AccessController.doPrivileged(Native Method)
 at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

Expected

When a node goes into OOME or other heavy GC loop (hence not processing request) it should be evicted out of cluster other nodes not affected.

Actual

Node B can hang

Workaround

Restart/shutdown node that is hitting into OOME

is caused by

JRASERVER-66237 ehcache.listener.socketTimeoutMillis is not used during Naming.lookup of CachePeer

Closed

is related to

JRASERVER-66369 ehcache.listener.socketTimeoutMillis is not used for TCP/RMI handshakes

Closed

JRASERVER-63556 Implement semi-async cache replication for EhCache in DC

Closed

relates to

JRASERVER-64267 Removing Data Center node breaks JIRA login for around 10 minutes

Closed

JRASERVER-65977 [Data Center] Lock contention on ehcache in DefaultGlobalPermissionManager under high load with many users

Closed

JRASERVER-68548 Cluster cache replication can cause high CPU across all nodes in the cluster and require a restart

Closed

JRASERVER-66393 Do not replicate caches to node that has problems receiving cache replication requests

Closed

PSR-43 Loading...

is blocked by: PSR-38 Loading...; PSR-44 Loading...; PSR-45 Loading...

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(3 relates to, 3 is blocked by, 88 mentioned in)

Assignee:: Unassigned

Reporter:: vkharisma

Votes:: 39 Vote for this issue

Watchers:: 66 Start watching this issue

Created:: 11/Nov/2016 6:15 AM

Updated:: 25/Sep/2020 9:36 AM

Resolved:: 08/Dec/2017 9:25 AM

Details

Description

Summary

Environment

Diagnostic

Expected

Actual

Workaround

Attachments

Issue Links

Forms

Activity

People

Dates