Cluster Cache replication health check fails with error SocketException: Broken pipe exception

XMLWordPrintable

    • 7.13
    • 23
    • Severity 3 - Minor
    • 1

      Issue Summary

      Cluster Cache replication health check fails and the nodes cannot communicate with each other to replicate cache.

      Name: Cluster Cache Replication
      NodeId: null
      Is healthy: false
      Failure reason: ["The node node3 is not replicating","The node node2 is not replicating"]
      Severity: CRITICAL
      

      But the exception from the atlassian-jira.log is generic and does not provide many details on the cause:

      LocalQCacheOp{cacheName='com.atlassian.jira.plugins.healthcheck.service.HeartBeatService.heartbeat', action=PUT, key=node2, value == null ? false, replicatePutsViaCopy=true, creationTimeInMillis=1622831185825} from cache replication queue: [queueId=queue_node1_2_164546f60261c7e4be0c5f5f9aaeec86_put, queuePath=/var/atlassian/application-data/jira-home/localq/queue_node1_2_164546f60261c7e4be0c5f5f9aaeec86_put], failuresCount: 1/1. Removing from queue. Error: java.rmi.MarshalException: error marshalling arguments; nested exception is: 
          	java.net.SocketException: Broken pipe (Write failed)
      com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpSender$UnrecoverableFailure: java.rmi.MarshalException: error marshalling arguments; nested exception is: 
      	java.net.SocketException: Broken pipe (Write failed)
      	at com.atlassian.jira.cluster.distribution.localq.rmi.LocalQCacheOpRMISender.send(LocalQCacheOpRMISender.java:90)

      Steps to Reproduce

      1. Setup a Jira Data Center with 2 or more nodes
      2. Add two entries in the /etc/hosts file with both external and internal IP mappings, for example:
        172.20.40.245 node01
        127.0.0.1 node01
        

        The reason for this behavior is a duplicate entry in the etc/hosts file containing an external and internal IP causing a loop.
        The problem can be explained as follows:

      • Each node uses its own hostname to communicate with each other
      • In case of duplicated entries in the hosts file, the hostname is resolved to the loopback IP 127.0.1.1
      • Due to this, master detects that 127.0.1.1 is trying to communicate with it, and it recognizes that IP as itself instead of secondary
      • The same happens to the secondary node
      • So basically the nodes are not able to communicate with each other and end up in a loop

        Expected Results

        Provide more details about the problem with the error message. ie, any indication of the host communication failure or loop.

      Actual Results

      The error messages are generic and offer no indication that the hostname has a localhost address and an external address just by looking at the cache replication errors alone

      Workaround

      Check /etc/hosts file for duplicated entries, for example:

      127.0.1.1 ip-node-1
      127.0.0.1 ip-node-1
      

      You may also check:
      JIRA - Only One Node Will Start in Cluster

            Assignee:
            Benjamin Suess
            Reporter:
            Victoria M
            Votes:
            2 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: