Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-79547

Synchrony stops working with "java.io.IOException: Too many open files" on Linux when specifically using multicast Hazelcast discovery

    XMLWordPrintable

Details

    Description

      The fix for this bug has been released to our Long Term Support release.

      The fix for this bug is now available in the latest release of Confluence 7.13 and 7.19

      Issue Summary

      Synchrony stops working after 5 - 6 days with this error appearing in atlassian-synchrony.log:

      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:14:22,765 WARN [aleph-netty-server-event-pool-1] [io.netty.channel.DefaultChannelPipeline] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] java.io.IOException: Too many open files
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at java.base/sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at java.base/sun.nio.ch.ServerSocketChannelImpl.accept(Unknown Source)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at java.base/sun.nio.ch.ServerSocketChannelImpl.accept(Unknown Source)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at io.netty.util.internal.SocketUtils$5.run(SocketUtils.java:119)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at io.netty.util.internal.SocketUtils$5.run(SocketUtils.java:116)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at java.base/java.security.AccessController.doPrivileged(Native Method)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at io.netty.util.internal.SocketUtils.accept(SocketUtils.java:116)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:147)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:79)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at manifold.executor$thread_factory$reify__14792$f__14793.invoke(executor.clj:47)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at clojure.lang.AFn.run(AFn.java:22)
      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	at java.base/java.lang.Thread.run(Unknown Source)
      

      before it ultimately stops working with:

      2022-07-17 06:16:15,691 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:16:15,690 INFO [hz._hzInstance_1_test-Synchrony.cached.thread-14] [internal.partition.impl.MigrationManager] [10.8.8.1]:5701 [test-Synchrony] [3.12.11] Shutdown request of Member [10.8.8.1]:5701 - 62a51355-f947-4160-a4a8-8a2eab729530 this is handled
      2022-07-17 06:16:16,110 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:16:16,110 INFO [hz._hzInstance_1_test-Synchrony.io.error.shutdown] [com.hazelcast.instance.Node] [10.8.8.1]:5701 [test-Synchrony] [3.12.11] Shutting down multicast service...
      2022-07-17 06:16:16,263 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:16:16,263 INFO [hz._hzInstance_1_test-Synchrony.io.error.shutdown] [com.hazelcast.instance.Node] [10.8.8.1]:5701 [test-Synchrony] [3.12.11] Shutting down connection manager...
      
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-17 01:20:00,343 ERROR [async-dispatch-15] [synchrony.middleware.errors] error processing request {:id "-503laUB57BReLnqU7OwHA", :throwable #error {
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	 :cause "State: SHUT_DOWN Operation: class com.hazelcast.map.impl.operation.GetOperation"
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	 :via
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	 [{:type com.hazelcast.core.HazelcastInstanceNotActiveException
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	   :message "State: SHUT_DOWN Operation: class com.hazelcast.map.impl.operation.GetOperation"
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	   :at [com.hazelcast.spi.impl.operationservice.impl.Invocation engineActive "Invocation.java" 569]}]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	 :trace
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	 [[com.hazelcast.spi.impl.operationservice.impl.Invocation engineActive "Invocation.java" 569]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [com.hazelcast.spi.impl.operationservice.impl.InvocationBuilderImpl invoke "InvocationBuilderImpl.java" 61]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [com.hazelcast.map.impl.proxy.MapProxySupport invokeOperation "MapProxySupport.java" 434]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [com.hazelcast.map.impl.proxy.MapProxyImpl get "MapProxyImpl.java" 120]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [hz_atom.core.HazelcastAtom deref "core.clj" 29]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [clojure.core$deref invokeStatic "core.clj" 2320]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [hz_atom.core$swap_hz_atom invokeStatic "core.clj" 7]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [hz_atom.core.HazelcastAtom swap "core.clj" 38]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [clojure.core$swap_BANG_ invokeStatic "core.clj" 2353]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [synchrony.sync.hub_locking$remove_lock invokeStatic "hub_locking.clj" 96]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [synchrony.http.entity_api$delete_hub_lock_state_handler$fn__62125 invoke "entity_api.clj" 649]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [clojure.lang.AFn run "AFn.java" 22]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [java.util.concurrent.ThreadPoolExecutor runWorker nil -1]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [java.util.concurrent.ThreadPoolExecutor$Worker run nil -1]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [java.lang.Thread run nil -1]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [------ submitted from ------  nil -1]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [com.hazelcast.spi.impl.operationservice.impl.InvocationFuture resolve "InvocationFuture.java" 126]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [com.hazelcast.spi.impl.AbstractInvocationFuture get "AbstractInvocationFuture.java" 155]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [com.hazelcast.map.impl.proxy.MapProxySupport invokeOperation "MapProxySupport.java" 435]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [com.hazelcast.map.impl.proxy.MapProxyImpl get "MapProxyImpl.java" 120]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [hz_atom.core.HazelcastAtom deref "core.clj" 29]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [clojure.core$deref invokeStatic "core.clj" 2320]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [hz_atom.core$swap_hz_atom invokeStatic "core.clj" 7]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [hz_atom.core.HazelcastAtom swap "core.clj" 38]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [clojure.core$swap_BANG_ invokeStatic "core.clj" 2353]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [synchrony.sync.hub_locking$remove_lock invokeStatic "hub_locking.clj" 96]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [synchrony.http.entity_api$delete_hub_lock_state_handler$fn__62125 invoke "entity_api.clj" 649]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [synchrony.http.entity-api (take?) "entity_api.clj" 70]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [synchrony.http.entity_api$wrap_validate_jwt$fn__60279$fn__60351 invoke "entity_api.clj" 61]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [synchrony.middleware.cache-control (take?) "cache_control.clj" 9]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [synchrony.middleware.cache_control$wrap_cache_control$fn__58126$fn__58166 invoke "cache_control.clj" 8]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [synchrony.middleware.json (take?) "json.clj" 82]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [synchrony.middleware.json$wrap_json_response$fn__59295$fn__59327 invoke "json.clj" 81]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [synchrony.middleware.gzip (take?) "gzip.clj" 37]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [synchrony.middleware.gzip$wrap_gzip$fn__58416$fn__58473 invoke "gzip.clj" 35]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [clojure.lang.AFn run "AFn.java" 22]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [java.util.concurrent.ThreadPoolExecutor runWorker nil -1]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [java.util.concurrent.ThreadPoolExecutor$Worker run nil -1]
      2022-07-17 08:20:00,343 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 	  [java.lang.Thread run nil -1]]}}
      

      This is reproducible on Data Center: Yes

      Steps to Reproduce

      1. Deploy any of these (but not limited to these OS)
        • CentOS 7
        • Amazon Ubuntu
          • Could not reproduce on Amazon Linux
      2. Download and deploy Confluence 7.18.1
      3. Before starting Confluence, add this into setenv.sh to lower the ulimit to 4096:
        ulimit -n 4096
        
      4. During the installation
        • Apply a Confluence DC license
        • Enable Clustering mode, specifically, select multicast
      5. Fully complete the installation
      6. On the terminal:
        • Run tail -f atlassian-synchrony.log
        • Also check netstat -an | grep 5701 | grep CLOSE_WAIT

      Expected Results

      1. No warnings/errors reported in atlassian-synchrony.log
      2. No accumulation of CLOSE_WAIT on Synchrony Hazlecast port 5701

      Actual Results

      We can see netstat -an accumulating CLOSE_WAIT on port 5701:

      1. 12 entries
        Tue Jul 19 04:06:38 UTC 2022
        $ date && netstat -an | grep CLOSE_WAIT | grep -v ":443"
        Tue Jul 19 04:06:38 UTC 2022
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:60883        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:44061        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:38319        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:51619        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:51787        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:59073        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:54791        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:33085        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:37749        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:36061        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:53377        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:43427        CLOSE_WAIT
        
      2. About 18 hours later, we now have 547 entries
        Tue Jul 19 22:17:15 UTC 2022
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:45537        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:56607        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:46535        CLOSE_WAIT 
        ...
        ..
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:44349        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:54343        CLOSE_WAIT 
        tcp6       1      0 10.9.31.97:5701         10.9.31.97:37791        CLOSE_WAIT 
        
        $ netstat -an | grep 5701 | grep CLOSE_WAIT | wc -l
        547
        

      For the ports in CLOSE_WAIT, we can see this line was logged earlier in either atlassian-confluence.log or atlassian-synchrony.log:

      atlassian-confluence.log
      2022-07-19 14:31:04,462 WARN [hz.confluence.cached.thread-4] [hazelcast.interceptor.authenticator.DefaultClusterJoinManager] checkNodeAuthenticationEnabled CONNECT(10.9.31.97:54343 -> 10.9.31.97:5701): Node authentication failed: Cluster authentication failed. Please make sure all members share the same value for 'confluence.cluster.name' and 'confluence.cluster.authentication.secret' in confluence.cfg.xml.
      
      atlassian-synchrony.log
      2022-07-19 14:31:04,462 DEBUG [232:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-19 14:31:04,462 WARN [hz._hzInstance_1_testcluster-Synchrony.cached.thread-3] [hazelcast.interceptor.authenticator.DefaultClusterJoinManager] ACCEPT(10.9.31.97:5701 <- 10.9.31.97:54343): Node authentication failed: Cluster authentication failed. Please make sure all members share the same value for cluster name and shared secret. 
      

      which looks related and possibly caused by:

      After about 5 - 6 days, Synchrony process encounters this error and stops working:

      2022-07-17 06:14:23,069 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] java.io.IOException: Too many open files
      ..
      
      2022-07-17 06:16:15,691 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:16:15,690 INFO [hz._hzInstance_1_test-Synchrony.cached.thread-14] [internal.partition.impl.MigrationManager] [10.8.8.1]:5701 [test-Synchrony] [3.12.11] Shutdown request of Member [10.8.8.1]:5701 - 62a51355-f947-4160-a4a8-8a2eab729530 this is handled
      2022-07-17 06:16:16,110 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:16:16,110 INFO [hz._hzInstance_1_test-Synchrony.io.error.shutdown] [com.hazelcast.instance.Node] [10.8.8.1]:5701 [test-Synchrony] [3.12.11] Shutting down multicast service...
      2022-07-17 06:16:16,263 DEBUG [303:StdOutHandler [/opt/atlassian/confluence/jre/bin/java]] 2022-07-16 23:16:16,263 INFO [hz._hzInstance_1_test-Synchrony.io.error.shutdown] [com.hazelcast.instance.Node] [10.8.8.1]:5701 [test-Synchrony] [3.12.11] Shutting down connection manager...
      

      Diagnosis

      Check the number of CLOSE_WAIT ports against the Synchrony hazelcast port (5701 by default):

      netstat -an | grep 5701 | grep CLOSE_WAIT | grep -vi grep | wc -l
      

      This count should be 0 but you are affected by this issue if the above count is steadily increasing over several hours/days.

      Workaround

      Option 1 - Change to TCP/IP discovery

      Change Node Discovery from Multicast to TCP/IP or AWS if possible.

      Option 2 - Increase open file limit

      If multicast is the only option for Hazelcast node discovery:

      1. Set a higher ulimit in <confluence-install>/bin/setenv.sh as per Health Check: Confluence Open Files Limit
        • A value of 48000 will provide stability for Synchrony for approximately 3 months before it will reach the open file limit
        • Restart the Confluence on the affected node for the above changes to take effect
        • This would need to be done for each node in the Confluence DC cluster.
      2. Setup a schedule to restart individual Confluence nodes before Synchrony reaches the configured ulimit value

      Option 3 (quickest workaround) - Change the Synchrony multicast discovery port

      1. Shutdown the entire Confluence Cluster
      2. Create/edit the <confluence-shared-home>/synchrony-args.properties file
        • Add in this line:
          cluster.join.multicast.port=54328
          
        • This will configure Synchrony hazelcast multicast discovery to work over port 54328 (Confluence hazelcast multicast will still work over the default port 54327)
        • This will now stop Confluence and Synchrony hazelcast discovery conflicting with each other and stop the accumulating CLOSE_WAIT ports
      3. Start Confluence (one node at a time)

      Attachments

        Issue Links

          Activity

            People

              bhyde Bradley Hyde
              hlam@atlassian.com Eric Lam
              Votes:
              19 Vote for this issue
              Watchers:
              36 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: