-
Type:
Bug
-
Resolution: Not a bug
-
Priority:
Low
-
None
-
Affects Version/s: 7.13.6, 8.13.6
-
Component/s: Data Center - Node replication
-
7.13
-
23
-
Severity 3 - Minor
-
1
Issue Summary
Cluster Cache replication health check fails and the nodes cannot communicate with each other to replicate cache.
Name: Cluster Cache Replication NodeId: null Is healthy: false Failure reason: ["The node node3 is not replicating","The node node2 is not replicating"] Severity: CRITICAL
But the exception from the atlassian-jira.log is generic and does not provide many details on the cause:
LocalQCacheOp{cacheName='com.atlassian.jira.plugins.healthcheck.service.HeartBeatService.heartbeat', action=PUT, key=node2, value == null ? false, replicatePutsViaCopy=true, creationTimeInMillis=1622831185825} from cache replication queue: [queueId=queue_node1_2_164546f60261c7e4be0c5f5f9aaeec86_put, queuePath=/var/atlassian/application-data/jira-home/localq/queue_node1_2_164546f60261c7e4be0c5f5f9aaeec86_put], failuresCount: 1/1. Removing from queue. Error: java.rmi.MarshalException: error marshalling arguments; nested exception is:
java.net.SocketException: Broken pipe (Write failed)
com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpSender$UnrecoverableFailure: java.rmi.MarshalException: error marshalling arguments; nested exception is:
java.net.SocketException: Broken pipe (Write failed)
at com.atlassian.jira.cluster.distribution.localq.rmi.LocalQCacheOpRMISender.send(LocalQCacheOpRMISender.java:90)
Steps to Reproduce
- Setup a Jira Data Center with 2 or more nodes
- Add two entries in the /etc/hosts file with both external and internal IP mappings, for example:
172.20.40.245 node01 127.0.0.1 node01
The reason for this behavior is a duplicate entry in the etc/hosts file containing an external and internal IP causing a loop.
The problem can be explained as follows:
- Each node uses its own hostname to communicate with each other
- In case of duplicated entries in the hosts file, the hostname is resolved to the loopback IP 127.0.1.1
- Due to this, master detects that 127.0.1.1 is trying to communicate with it, and it recognizes that IP as itself instead of secondary
- The same happens to the secondary node
- So basically the nodes are not able to communicate with each other and end up in a loop
Expected Results
Provide more details about the problem with the error message. ie, any indication of the host communication failure or loop.
Actual Results
The error messages are generic and offer no indication that the hostname has a localhost address and an external address just by looking at the cache replication errors alone
Workaround
Check /etc/hosts file for duplicated entries, for example:
127.0.1.1 ip-node-1 127.0.0.1 ip-node-1
You may also check:
JIRA - Only One Node Will Start in Cluster