Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-57906

Cluster Monitoring panel shows that nodes are unable to reach each other due to mismatched "serialVersionUID" values for in the "RemoteModuleCallable" class

      Issue Summary

      In a Data Center environment, an administrator may find that within Confluence Admin > Clustering, the user interface shows that nodes are unable to reach one another other with errors like:

      The node [xxxxxx] is temporarily not reachable. Please check the server logs.

      The actual cluster is up and running, despite what the UI is suggesting. However, the message in the UI is alarming to administrators and should be corrected.

      Environment

      This issue was first observed in a 2-node Confluence Data Center 6.6.3 cluster using AWS cluster join method. Both nodes were using Java 8 update 162:

      <java.runtime.version>1.8.0_162-b12</java.runtime.version>
      

      Steps to Reproduce

      Unknown, the issue may be intermittent as serialVersionUID looks to be auto-generated by the JVM at run time.

      Expected Results

      Cluster Monitoring UI shows that nodes are able to communicate with each other.

      Actual Results

      Cluster Monitoring UI shows that nodes cannot reach one another. However, the actual cluster itself is up and running, despite what this UI is saying.

      Logs the following corresponding warnings:

      2019-02-13 09:22:37,800 WARN [ajp-nio-127.0.0.1-8009-exec-194] [cluster.hazelcast.monitoring.HazelcastClusterMonitoring] getData Exception happened when receiving response from node 438b4c58
       -- referer: https://example.confluence.com:9443/plugins/servlet/cluster-monitoring | url: /rest/atlassian-cluster-monitoring/cluster/suppliers/data/com.atlassian.cluster.monitoring.cluster-monitoring-plugin/runtime-information/438b4c58 | traceId: 9c5a10920fd04be5 | userName: admin
      java.util.concurrent.ExecutionException: com.hazelcast.nio.serialization.HazelcastSerializationException: java.io.InvalidClassException: com.atlassian.confluence.cluster.hazelcast.monitoring.RemoteModuleCallable; local class incompatible: stream classdesc serialVersionUID = 597473803974431210, local class serialVersionUID = 2184010817253012516
      	at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:357)
      	at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
      	at com.hazelcast.util.executor.DelegatingFuture.get(DelegatingFuture.java:71)
      	at com.atlassian.confluence.cluster.hazelcast.monitoring.HazelcastClusterMonitoring.getData(HazelcastClusterMonitoring.java:79)
      ...
      Caused by: com.hazelcast.nio.serialization.HazelcastSerializationException: java.io.InvalidClassException: com.atlassian.confluence.cluster.hazelcast.monitoring.RemoteModuleCallable; local class incompatible: stream classdesc serialVersionUID = 597473803974431210, local class serialVersionUID = 2184010817253012516
      ...
      Caused by: java.io.InvalidClassException: com.atlassian.confluence.cluster.hazelcast.monitoring.RemoteModuleCallable; local class incompatible: stream classdesc serialVersionUID = 597473803974431210, local class serialVersionUID = 2184010817253012516
      	at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:687)
      	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1876)
      	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1745)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2033)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
      	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
      	at com.hazelcast.nio.serialization.DefaultSerializers$ObjectSerializer.read(DefaultSerializers.java:201)
      	at com.hazelcast.nio.serialization.StreamSerializerAdapter.read(StreamSerializerAdapter.java:41)
      	at com.hazelcast.nio.serialization.SerializationServiceImpl.toObject(SerializationServiceImpl.java:276)
      ...
      
      2019-02-13 09:22:37,802 WARN [ajp-nio-127.0.0.1-8009-exec-194] [cluster.monitoring.rest.ClusterMonitoringResource] getDataProviderInformationForNode Error received when querying remote node [438b4c58]: 
       -- referer: https://example.confluence.com:9443/plugins/servlet/cluster-monitoring | url: /rest/atlassian-cluster-monitoring/cluster/suppliers/data/com.atlassian.cluster.monitoring.cluster-monitoring-plugin/runtime-information/438b4c58 | traceId: 9c5a10920fd04be5 | userName: admin
      

      Notes

      Some notes from Development review:

      We should manually set the `serialVersionUID` in the class `RemoteModuleCallable`, instead of having it autogenerate. Usually, that generated ID is going to be the same in both nodes, but because this is done by the JVM internally, any minute differences in the environment or just due to sheer luck, a different ID gets generated by different nodes.

      Workaround

      Restarting Confluence may help, but due to the random nature of this problem, it is not 100% guaranteed to resolve the issue.

          Form Name

            [CONFSERVER-57906] Cluster Monitoring panel shows that nodes are unable to reach each other due to mismatched "serialVersionUID" values for in the "RemoteModuleCallable" class

            I confirm that this was our problem. I switched the old one from AppDynamics to Datadog APM agent, restarted Confluence on that node, and the error in the GUI and the logs is gone.

            Pascal Robert added a comment - I confirm that this was our problem. I switched the old one from AppDynamics to Datadog APM agent, restarted Confluence on that node, and the error in the GUI and the logs is gone.

            Can one of the reasons can be a change in the javaagent? We are rolling the Datadog agent on new nodes, while older nodes still have the AppDynamics agent, and we started having this error.

            Pascal Robert added a comment - Can one of the reasons can be a change in the javaagent? We are rolling the Datadog agent on new nodes, while older nodes still have the AppDynamics agent, and we started having this error.

            I just faced this issue, was revolved with a rolling restart of all nodes. I also dropped the plugin-cache directories on the first node which was restarted.

            Steven Milisavic added a comment - I just faced this issue, was revolved with a rolling restart of all nodes. I also dropped the plugin-cache directories on the first node which was restarted.

              Unassigned Unassigned
              rchang Robert Chang
              Affected customers:
              7 This affects my team
              Watchers:
              19 Start watching this issue

                Created:
                Updated: