Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-35930

Confluence Data Center nodes can get into a deadlocked state

      We had a situation on our production instance confluence.atlassian.com where two cluster nodes got into a deadlocked state. Fortunately, one of the other nodes remained unaffected and was able to continue serving requests, so it did not bring down the whole cluster.

      Here is the scenario:

      node1

      • All hazelcast listener threads were attempting to process near cache entry invalidations messages
      • In order to do this, they need to deserialize the class of the cache key.
      • This, in turn, requires loading of the correct key class
      • The ConfluenceUberClassLoader delegates out to the pluginClassLoader, which is synchronous.

      So by this point we have all the listener threads all waiting for the class loading lock. Now:

      • One thread holds the lock, is attempting to load the class of the cache key to invalidate.
      • In order to fetch the correct classloader, it needs to check that the right plugin is enabled.
      • To check a plugin is enabled, it queries the PluginStateStore
      • The CachingPluginStateStore is stored in a near cache, so usually this call is fast. However in this case the near cache had been invalidated
      • So in order to read the state of the plugin system, the PluginStateStore had to be reloaded from the cache.
      • The partition of the cache needed was on another node, so Hazelcast attempts to read it with a remote operation

      So by now we have all the threads on node1 waiting for the classloader, which in turn is waiting for the PluginStateStore, which is stored on another node so is waiting for a remote cache operation.

      The stack trace for the thread attempting the operation:

      "async-entry-listener-19" daemon prio=10 tid=0x00007f4098a90800 nid=0x7d26 in Object.wait() [0x00007f3ff39bc000]
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	at com.hazelcast.spi.impl.BasicInvocationFuture.waitForResponse(BasicInvocationFuture.java:194)
      	- locked <0x000000077d63ea60> (a com.hazelcast.spi.impl.BasicInvocationFuture)
      	at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:166)
      	at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:146)
      	at com.hazelcast.map.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:337)
      	at com.hazelcast.map.proxy.MapProxySupport.getInternal(MapProxySupport.java:240)
      	at com.hazelcast.map.proxy.MapProxyImpl.get(MapProxyImpl.java:82)
      	at com.atlassian.confluence.cache.hazelcast.nearcache.NearCacheLoader.load(NearCacheLoader.java:25)
      	at com.atlassian.confluence.cache.hazelcast.monitoring.ConfluenceMonitoringLoader.load(ConfluenceMonitoringLoader.java:37)
      	at com.atlassian.cache.ehcache.EhCacheManager$3$1.createEntry(EhCacheManager.java:168)
      	at net.sf.ehcache.constructs.blocking.SelfPopulatingCache.get(SelfPopulatingCache.java:73)
      	at com.atlassian.cache.ehcache.DelegatingCache.get(DelegatingCache.java:74)
      	at com.atlassian.confluence.cache.ehcache.DefaultConfluenceEhCache.get(DefaultConfluenceEhCache.java:36)
      	at com.atlassian.confluence.cache.ConfluenceMonitoringCache.get(ConfluenceMonitoringCache.java:43)
      	at com.atlassian.confluence.cache.hazelcast.nearcache.ConfluenceNearCache.get(ConfluenceNearCache.java:57)
      	at com.atlassian.confluence.cache.ConfluenceMonitoringCache.get(ConfluenceMonitoringCache.java:43)
      	at com.atlassian.confluence.cache.TransactionalCacheFactory$TransactionAwareCache.get(TransactionalCacheFactory.java:335)
      	at com.atlassian.confluence.plugin.CachingPluginStateStore.load(CachingPluginStateStore.java:47)
      	at sun.reflect.GeneratedMethodAccessor62.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:307)
      	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:198)
      	at com.sun.proxy.$Proxy28.load(Unknown Source)
      	at com.atlassian.plugin.manager.DefaultPluginManager.getState(DefaultPluginManager.java:956)
      	at com.atlassian.plugin.manager.DefaultPluginManager.isPluginEnabled(DefaultPluginManager.java:1987)
      	at com.atlassian.plugin.classloader.PluginsClassLoader.isPluginEnabled(PluginsClassLoader.java:211)
      	at com.atlassian.plugin.classloader.PluginsClassLoader.findClass(PluginsClassLoader.java:104)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      	- locked <0x00000006b2cedc48> (a com.atlassian.plugin.classloader.PluginsClassLoader)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      	at com.atlassian.confluence.util.ConfluenceUberClassLoader.loadClass(ConfluenceUberClassLoader.java:83)
      	at com.hazelcast.nio.ClassLoaderUtil.tryLoadClass(ClassLoaderUtil.java:124)
      	at com.hazelcast.nio.ClassLoaderUtil.loadClass(ClassLoaderUtil.java:97)
      	at com.hazelcast.nio.IOUtil$1.resolveClass(IOUtil.java:98)
      	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
      	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
      	at com.hazelcast.nio.serialization.DefaultSerializers$ObjectSerializer.read(DefaultSerializers.java:196)
      	at com.hazelcast.nio.serialization.StreamSerializerAdapter.read(StreamSerializerAdapter.java:63)
      	at com.hazelcast.nio.serialization.SerializationServiceImpl.toObject(SerializationServiceImpl.java:261)
      	at com.hazelcast.map.DataAwareEntryEvent.getKey(DataAwareEntryEvent.java:62)
      	at com.atlassian.confluence.cache.hazelcast.nearcache.NearCacheEntryListener.invalidateIfNonLocal(NearCacheEntryListener.java:72)
      	at com.atlassian.confluence.cache.hazelcast.nearcache.NearCacheEntryListener.entryEvicted(NearCacheEntryListener.java:45)
      	at com.atlassian.confluence.cache.hazelcast.DefaultAsyncEntryListenerGenerator$CallDelegateTask.run(DefaultAsyncEntryListenerGenerator.java:187)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:744)
      
         Locked ownable synchronizers:
      	- <0x00000006b3bf55e8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
      	- <0x00000006c69d7090> (a java.util.concurrent.ThreadPoolExecutor$Worker)
      

      node2
      Node 2's threads were all stuck trying to load the same cache value (from a Hibernate L2 cache) from a remote cache (node1's). Since all of node1's HZ listener threads were used up waiting, node2's threads got stuck too, waiting for a response from node1. Deadlock

      What we can do about it:
      The crux of the problem is what is happening on node1:

      Cache Operation -> load class -> read plugin state -> remote cache operation

      This circular dependence on cache operations is what leaves us open to deadlocks. This is particularly an issue because the PluginStateStore is needed for pretty much everything.

      oburn had a great idea: Make the PluginStateStore NOT a regular cache, but a local data store, always. Have it listen to application events to keep itself up to date asynchronously (we do this anyway with plugin events). Ensure it always has a value, so reading the plugin system state is always reliable, and fast. This would allow the threads on node1 to untangle themselves and "bottom out" at a local call, rather than a remote one.

            [CONFSERVER-35930] Confluence Data Center nodes can get into a deadlocked state

            I may have just run into this on 5.10.8

            Dana Cleveland added a comment - I may have just run into this on 5.10.8

            sthanu added a comment -

            We are experiencing this issue in confluence 5.10.8 too

            sthanu added a comment - We are experiencing this issue in confluence 5.10.8 too

            Note that in Confluence 5.7 this will still be an issue, however it should happen much less frequently. In 5.7 we only use the UberClassLoader (and it's serial nature) as a last resort, so most of the time the other threads won't have to lock waiting for it.

            David Rizzuto added a comment - Note that in Confluence 5.7 this will still be an issue, however it should happen much less frequently. In 5.7 we only use the UberClassLoader (and it's serial nature) as a last resort, so most of the time the other threads won't have to lock waiting for it.

              psemeniuk Petro Semeniuk (Inactive)
              drizzuto David Rizzuto
              Affected customers:
              0 This affects my team
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: