Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-35930

Confluence Data Center nodes can get into a deadlocked state

    XMLWordPrintable

Details

    Description

      We had a situation on our production instance confluence.atlassian.com where two cluster nodes got into a deadlocked state. Fortunately, one of the other nodes remained unaffected and was able to continue serving requests, so it did not bring down the whole cluster.

      Here is the scenario:

      node1

      • All hazelcast listener threads were attempting to process near cache entry invalidations messages
      • In order to do this, they need to deserialize the class of the cache key.
      • This, in turn, requires loading of the correct key class
      • The ConfluenceUberClassLoader delegates out to the pluginClassLoader, which is synchronous.

      So by this point we have all the listener threads all waiting for the class loading lock. Now:

      • One thread holds the lock, is attempting to load the class of the cache key to invalidate.
      • In order to fetch the correct classloader, it needs to check that the right plugin is enabled.
      • To check a plugin is enabled, it queries the PluginStateStore
      • The CachingPluginStateStore is stored in a near cache, so usually this call is fast. However in this case the near cache had been invalidated
      • So in order to read the state of the plugin system, the PluginStateStore had to be reloaded from the cache.
      • The partition of the cache needed was on another node, so Hazelcast attempts to read it with a remote operation

      So by now we have all the threads on node1 waiting for the classloader, which in turn is waiting for the PluginStateStore, which is stored on another node so is waiting for a remote cache operation.

      The stack trace for the thread attempting the operation:

      "async-entry-listener-19" daemon prio=10 tid=0x00007f4098a90800 nid=0x7d26 in Object.wait() [0x00007f3ff39bc000]
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	at com.hazelcast.spi.impl.BasicInvocationFuture.waitForResponse(BasicInvocationFuture.java:194)
      	- locked <0x000000077d63ea60> (a com.hazelcast.spi.impl.BasicInvocationFuture)
      	at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:166)
      	at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:146)
      	at com.hazelcast.map.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:337)
      	at com.hazelcast.map.proxy.MapProxySupport.getInternal(MapProxySupport.java:240)
      	at com.hazelcast.map.proxy.MapProxyImpl.get(MapProxyImpl.java:82)
      	at com.atlassian.confluence.cache.hazelcast.nearcache.NearCacheLoader.load(NearCacheLoader.java:25)
      	at com.atlassian.confluence.cache.hazelcast.monitoring.ConfluenceMonitoringLoader.load(ConfluenceMonitoringLoader.java:37)
      	at com.atlassian.cache.ehcache.EhCacheManager$3$1.createEntry(EhCacheManager.java:168)
      	at net.sf.ehcache.constructs.blocking.SelfPopulatingCache.get(SelfPopulatingCache.java:73)
      	at com.atlassian.cache.ehcache.DelegatingCache.get(DelegatingCache.java:74)
      	at com.atlassian.confluence.cache.ehcache.DefaultConfluenceEhCache.get(DefaultConfluenceEhCache.java:36)
      	at com.atlassian.confluence.cache.ConfluenceMonitoringCache.get(ConfluenceMonitoringCache.java:43)
      	at com.atlassian.confluence.cache.hazelcast.nearcache.ConfluenceNearCache.get(ConfluenceNearCache.java:57)
      	at com.atlassian.confluence.cache.ConfluenceMonitoringCache.get(ConfluenceMonitoringCache.java:43)
      	at com.atlassian.confluence.cache.TransactionalCacheFactory$TransactionAwareCache.get(TransactionalCacheFactory.java:335)
      	at com.atlassian.confluence.plugin.CachingPluginStateStore.load(CachingPluginStateStore.java:47)
      	at sun.reflect.GeneratedMethodAccessor62.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:307)
      	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:198)
      	at com.sun.proxy.$Proxy28.load(Unknown Source)
      	at com.atlassian.plugin.manager.DefaultPluginManager.getState(DefaultPluginManager.java:956)
      	at com.atlassian.plugin.manager.DefaultPluginManager.isPluginEnabled(DefaultPluginManager.java:1987)
      	at com.atlassian.plugin.classloader.PluginsClassLoader.isPluginEnabled(PluginsClassLoader.java:211)
      	at com.atlassian.plugin.classloader.PluginsClassLoader.findClass(PluginsClassLoader.java:104)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      	- locked <0x00000006b2cedc48> (a com.atlassian.plugin.classloader.PluginsClassLoader)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      	at com.atlassian.confluence.util.ConfluenceUberClassLoader.loadClass(ConfluenceUberClassLoader.java:83)
      	at com.hazelcast.nio.ClassLoaderUtil.tryLoadClass(ClassLoaderUtil.java:124)
      	at com.hazelcast.nio.ClassLoaderUtil.loadClass(ClassLoaderUtil.java:97)
      	at com.hazelcast.nio.IOUtil$1.resolveClass(IOUtil.java:98)
      	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
      	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
      	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
      	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
      	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
      	at com.hazelcast.nio.serialization.DefaultSerializers$ObjectSerializer.read(DefaultSerializers.java:196)
      	at com.hazelcast.nio.serialization.StreamSerializerAdapter.read(StreamSerializerAdapter.java:63)
      	at com.hazelcast.nio.serialization.SerializationServiceImpl.toObject(SerializationServiceImpl.java:261)
      	at com.hazelcast.map.DataAwareEntryEvent.getKey(DataAwareEntryEvent.java:62)
      	at com.atlassian.confluence.cache.hazelcast.nearcache.NearCacheEntryListener.invalidateIfNonLocal(NearCacheEntryListener.java:72)
      	at com.atlassian.confluence.cache.hazelcast.nearcache.NearCacheEntryListener.entryEvicted(NearCacheEntryListener.java:45)
      	at com.atlassian.confluence.cache.hazelcast.DefaultAsyncEntryListenerGenerator$CallDelegateTask.run(DefaultAsyncEntryListenerGenerator.java:187)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:744)
      
         Locked ownable synchronizers:
      	- <0x00000006b3bf55e8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
      	- <0x00000006c69d7090> (a java.util.concurrent.ThreadPoolExecutor$Worker)
      

      node2
      Node 2's threads were all stuck trying to load the same cache value (from a Hibernate L2 cache) from a remote cache (node1's). Since all of node1's HZ listener threads were used up waiting, node2's threads got stuck too, waiting for a response from node1. Deadlock

      What we can do about it:
      The crux of the problem is what is happening on node1:

      Cache Operation -> load class -> read plugin state -> remote cache operation

      This circular dependence on cache operations is what leaves us open to deadlocks. This is particularly an issue because the PluginStateStore is needed for pretty much everything.

      oburn had a great idea: Make the PluginStateStore NOT a regular cache, but a local data store, always. Have it listen to application events to keep itself up to date asynchronously (we do this anyway with plugin events). Ensure it always has a value, so reading the plugin system state is always reliable, and fast. This would allow the threads on node1 to untangle themselves and "bottom out" at a local call, rather than a remote one.

      Attachments

        Issue Links

          Activity

            People

              psemeniuk Petro Semeniuk (Inactive)
              drizzuto David Rizzuto
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: