Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-66369

ehcache.listener.socketTimeoutMillis is not used for TCP/RMI handshakes

    XMLWordPrintable

Details

    Description

      Summary

      In Jira Data Center during cache replication value of ehcache.listener.socketTimeoutMillis from clustered.properties (or a default value) should be used as read timeout for remote RMI calls to other nodes in cluster. Instead an infinity is used. Problems with communication with one node can bring entire cluster down.

      Environment

      • JIRA datacenter with multiple nodes
        Node A is unresponsive because of extremely high load or high memory pressure or any other condition that makes it unresponsive. However, at this state node is not technically down and still registered as an 'Active' member in the cluster but not processing request either.
        Node B still consider node A as 'Active' so it keeps performing cache synchronisation to Node A which not responding to the request and put Node B in stale position. 

      Symptoms

       

      Expected behaviour

      TCP and RMI handshakes will throw an exception after the specified timeout has passed.

      Workarounds

      Restart or gracefully shutdown the unresponsive node.

      Note on fix

      It makes EhCache replication use finite (default 5s) timeouts for TCP and RMI handshakes during cache replication.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              klopacinski Karol Lopacinski
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: