-
Bug
-
Resolution: Fixed
-
Highest
-
7.13.7, 7.17.4, 7.18.1, 7.18.3
-
23
-
Severity 2 - Major
-
292
-
The fix for this bug has been released to our Long Term Support release.
The fix for this bug is now available in the latest release of Confluence 7.13 and 7.19
Issue Summary
Synchrony stops working after 5 - 6 days with this error appearing in atlassian-synchrony.log:
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:14:22,765 WARN [aleph-netty-server-event-pool-1] [io.netty.channel.DefaultChannelPipeline] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] java.io.IOException: Too many open files
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at java.base/sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at java.base/sun.nio.ch.ServerSocketChannelImpl.accept(Unknown Source)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at java.base/sun.nio.ch.ServerSocketChannelImpl.accept(Unknown Source)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at io.netty.util.internal.SocketUtils$5.run(SocketUtils.java:119)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at io.netty.util.internal.SocketUtils$5.run(SocketUtils.java:116)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at java.base/java.security.AccessController.doPrivileged(Native Method)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at io.netty.util.internal.SocketUtils.accept(SocketUtils.java:116)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:147)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:79)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at manifold.executor$thread_factory$reify__14792$f__14793.invoke(executor.clj:47)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at clojure.lang.AFn.run(AFn.java:22)
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] at java.base/java.lang.Thread.run(Unknown Source)
before it ultimately stops working with:
2022-07-17 06:16:15,691 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:16:15,690 INFO [hz._hzInstance_1_test-Synchrony.cached.thread-14] [internal.partition.impl.MigrationManager] [10.8.8.1]:5701 [test-Synchrony] [3.12.11] Shutdown request of Member [10.8.8.1]:5701 - 62a51355-f947-4160-a4a8-8a2eab729530 this is handled
2022-07-17 06:16:16,110 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:16:16,110 INFO [hz._hzInstance_1_test-Synchrony.io.error.shutdown] [com.hazelcast.instance.Node] [10.8.8.1]:5701 [test-Synchrony] [3.12.11] Shutting down multicast service...
2022-07-17 06:16:16,263 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:16:16,263 INFO [hz._hzInstance_1_test-Synchrony.io.error.shutdown] [com.hazelcast.instance.Node] [10.8.8.1]:5701 [test-Synchrony] [3.12.11] Shutting down connection manager...
2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-17 01:20:00,343 ERROR [async-dispatch-15] [synchrony.middleware.errors] error processing request {:id "-503laUB57BReLnqU7OwHA", :throwable #error { 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] :cause "State: SHUT_DOWN Operation: class com.hazelcast.map.impl.operation.GetOperation" 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] :via 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [{:type com.hazelcast.core.HazelcastInstanceNotActiveException 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] :message "State: SHUT_DOWN Operation: class com.hazelcast.map.impl.operation.GetOperation" 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] :at [com.hazelcast.spi.impl.operationservice.impl.Invocation engineActive "Invocation.java" 569]}] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] :trace 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [[com.hazelcast.spi.impl.operationservice.impl.Invocation engineActive "Invocation.java" 569] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [com.hazelcast.spi.impl.operationservice.impl.InvocationBuilderImpl invoke "InvocationBuilderImpl.java" 61] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [com.hazelcast.map.impl.proxy.MapProxySupport invokeOperation "MapProxySupport.java" 434] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [com.hazelcast.map.impl.proxy.MapProxyImpl get "MapProxyImpl.java" 120] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [hz_atom.core.HazelcastAtom deref "core.clj" 29] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [clojure.core$deref invokeStatic "core.clj" 2320] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [hz_atom.core$swap_hz_atom invokeStatic "core.clj" 7] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [hz_atom.core.HazelcastAtom swap "core.clj" 38] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [clojure.core$swap_BANG_ invokeStatic "core.clj" 2353] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [synchrony.sync.hub_locking$remove_lock invokeStatic "hub_locking.clj" 96] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [synchrony.http.entity_api$delete_hub_lock_state_handler$fn__62125 invoke "entity_api.clj" 649] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [clojure.lang.AFn run "AFn.java" 22] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [java.util.concurrent.ThreadPoolExecutor runWorker nil -1] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [java.util.concurrent.ThreadPoolExecutor$Worker run nil -1] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [java.lang.Thread run nil -1] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [------ submitted from ------ nil -1] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [com.hazelcast.spi.impl.operationservice.impl.InvocationFuture resolve "InvocationFuture.java" 126] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [com.hazelcast.spi.impl.AbstractInvocationFuture get "AbstractInvocationFuture.java" 155] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [com.hazelcast.map.impl.proxy.MapProxySupport invokeOperation "MapProxySupport.java" 435] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [com.hazelcast.map.impl.proxy.MapProxyImpl get "MapProxyImpl.java" 120] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [hz_atom.core.HazelcastAtom deref "core.clj" 29] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [clojure.core$deref invokeStatic "core.clj" 2320] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [hz_atom.core$swap_hz_atom invokeStatic "core.clj" 7] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [hz_atom.core.HazelcastAtom swap "core.clj" 38] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [clojure.core$swap_BANG_ invokeStatic "core.clj" 2353] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [synchrony.sync.hub_locking$remove_lock invokeStatic "hub_locking.clj" 96] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [synchrony.http.entity_api$delete_hub_lock_state_handler$fn__62125 invoke "entity_api.clj" 649] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [synchrony.http.entity-api (take?) "entity_api.clj" 70] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [synchrony.http.entity_api$wrap_validate_jwt$fn__60279$fn__60351 invoke "entity_api.clj" 61] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [synchrony.middleware.cache-control (take?) "cache_control.clj" 9] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [synchrony.middleware.cache_control$wrap_cache_control$fn__58126$fn__58166 invoke "cache_control.clj" 8] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [synchrony.middleware.json (take?) "json.clj" 82] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [synchrony.middleware.json$wrap_json_response$fn__59295$fn__59327 invoke "json.clj" 81] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [synchrony.middleware.gzip (take?) "gzip.clj" 37] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [synchrony.middleware.gzip$wrap_gzip$fn__58416$fn__58473 invoke "gzip.clj" 35] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [clojure.lang.AFn run "AFn.java" 22] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [java.util.concurrent.ThreadPoolExecutor runWorker nil -1] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [java.util.concurrent.ThreadPoolExecutor$Worker run nil -1] 2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] [java.lang.Thread run nil -1]]}}
This is reproducible on Data Center: Yes
Steps to Reproduce
- Deploy any of these (but not limited to these OS)
- CentOS 7
- Amazon Ubuntu
Could not reproduce on Amazon Linux
- Download and deploy Confluence 7.18.1
- Before starting Confluence, add this into setenv.sh to lower the ulimit to 4096:
ulimit -n 4096
- During the installation
- Apply a Confluence DC license
- Enable Clustering mode, specifically, select multicast
- Fully complete the installation
- On the terminal:
- Run tail -f atlassian-synchrony.log
- Also check netstat -an | grep 5701 | grep CLOSE_WAIT
Expected Results
- No warnings/errors reported in atlassian-synchrony.log
- No accumulation of CLOSE_WAIT on Synchrony Hazlecast port 5701
Actual Results
We can see netstat -an accumulating CLOSE_WAIT on port 5701:
- 12 entries
Tue Jul 19 04:06:38 UTC 2022
$ date && netstat -an | grep CLOSE_WAIT | grep -v ":443" Tue Jul 19 04:06:38 UTC 2022 tcp6 1 0 10.9.31.97:5701 10.9.31.97:60883 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:44061 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:38319 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:51619 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:51787 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:59073 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:54791 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:33085 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:37749 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:36061 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:53377 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:43427 CLOSE_WAIT
- About 18 hours later, we now have 547 entries
Tue Jul 19 22:17:15 UTC 2022
tcp6 1 0 10.9.31.97:5701 10.9.31.97:45537 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:56607 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:46535 CLOSE_WAIT ... .. tcp6 1 0 10.9.31.97:5701 10.9.31.97:44349 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:54343 CLOSE_WAIT tcp6 1 0 10.9.31.97:5701 10.9.31.97:37791 CLOSE_WAIT
$ netstat -an | grep 5701 | grep CLOSE_WAIT | wc -l 547
For the ports in CLOSE_WAIT, we can see this line was logged earlier in either atlassian-confluence.log or atlassian-synchrony.log:
2022-07-19 14:31:04,462 WARN [hz.confluence.cached.thread-4] [hazelcast.interceptor.authenticator.DefaultClusterJoinManager] checkNodeAuthenticationEnabled CONNECT(10.9.31.97:54343 -> 10.9.31.97:5701): Node authentication failed: Cluster authentication failed. Please make sure all members share the same value for 'confluence.cluster.name' and 'confluence.cluster.authentication.secret' in confluence.cfg.xml.
2022-07-19 14:31:04,462 DEBUG [232:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-19 14:31:04,462 WARN [hz._hzInstance_1_testcluster-Synchrony.cached.thread-3] [hazelcast.interceptor.authenticator.DefaultClusterJoinManager] ACCEPT(10.9.31.97:5701 <- 10.9.31.97:54343): Node authentication failed: Cluster authentication failed. Please make sure all members share the same value for cluster name and shared secret.
which looks related and possibly caused by:
After about 5 - 6 days, Synchrony process encounters this error and stops working:
2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] java.io.IOException: Too many open files ..
2022-07-17 06:16:15,691 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:16:15,690 INFO [hz._hzInstance_1_test-Synchrony.cached.thread-14] [internal.partition.impl.MigrationManager] [10.8.8.1]:5701 [test-Synchrony] [3.12.11] Shutdown request of Member [10.8.8.1]:5701 - 62a51355-f947-4160-a4a8-8a2eab729530 this is handled
2022-07-17 06:16:16,110 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:16:16,110 INFO [hz._hzInstance_1_test-Synchrony.io.error.shutdown] [com.hazelcast.instance.Node] [10.8.8.1]:5701 [test-Synchrony] [3.12.11] Shutting down multicast service...
2022-07-17 06:16:16,263 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:16:16,263 INFO [hz._hzInstance_1_test-Synchrony.io.error.shutdown] [com.hazelcast.instance.Node] [10.8.8.1]:5701 [test-Synchrony] [3.12.11] Shutting down connection manager...
Diagnosis
Check the number of CLOSE_WAIT ports against the Synchrony hazelcast port (5701 by default):
netstat -an | grep 5701 | grep CLOSE_WAIT | grep -vi grep | wc -l
This count should be 0 but you are affected by this issue if the above count is steadily increasing over several hours/days.
Workaround
Option 1 - Change to TCP/IP discovery
Change Node Discovery from Multicast to TCP/IP or AWS if possible.
Option 2 - Increase open file limit
If multicast is the only option for Hazelcast node discovery:
- Set a higher ulimit in <confluence-install>/bin/setenv.sh as per Health Check: Confluence Open Files Limit
- A value of 48000 will provide stability for Synchrony for approximately 3 months before it will reach the open file limit
- Restart the Confluence on the affected node for the above changes to take effect
This would need to be done for each node in the Confluence DC cluster.
- Setup a schedule to restart individual Confluence nodes before Synchrony reaches the configured ulimit value
Option 3 (quickest workaround) - Change the Synchrony multicast discovery port
- Shutdown the entire Confluence Cluster
- Create/edit the <confluence-shared-home>/synchrony-args.properties file
- Add in this line:
cluster.join.multicast.port=54328
- This will configure Synchrony hazelcast multicast discovery to work over port 54328 (Confluence hazelcast multicast will still work over the default port 54327)
- This will now stop Confluence and Synchrony hazelcast discovery conflicting with each other and stop the accumulating CLOSE_WAIT ports
- Add in this line:
- Start Confluence (one node at a time)
- relates to
-
CONFSERVER-79477 Confluence and Synchrony logs "Cluster authentication failed. Please make sure all members share the same value" with Hazelcast multicast discovery
-
- Closed
-
- mentioned in
-
Page Failed to load
-
Page Failed to load
-
Page Failed to load
-
Page Failed to load
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
A fix for this issue is available in Confluence Server and Data Center 8.1.0.
Upgrade now or check out the Release Notes to see what other issues are resolved.