Loading...

Details

Type: Bug
Resolution: Fixed
Priority: High
Fix Version/s: 5.7
Affects Version/s: 5.6, 5.6.4, 5.6.5
Component/s: None
Labels:

Bug Fix Policy:
View Atlassian Server bug fix policy

Description

We had a situation on our production instance confluence.atlassian.com where two cluster nodes got into a deadlocked state. Fortunately, one of the other nodes remained unaffected and was able to continue serving requests, so it did not bring down the whole cluster.

Here is the scenario:

node1

All hazelcast listener threads were attempting to process near cache entry invalidations messages
In order to do this, they need to deserialize the class of the cache key.
This, in turn, requires loading of the correct key class
The ConfluenceUberClassLoader delegates out to the pluginClassLoader, which is synchronous.

So by this point we have all the listener threads all waiting for the class loading lock. Now:

One thread holds the lock, is attempting to load the class of the cache key to invalidate.
In order to fetch the correct classloader, it needs to check that the right plugin is enabled.
To check a plugin is enabled, it queries the PluginStateStore
The CachingPluginStateStore is stored in a near cache, so usually this call is fast. However in this case the near cache had been invalidated
So in order to read the state of the plugin system, the PluginStateStore had to be reloaded from the cache.
The partition of the cache needed was on another node, so Hazelcast attempts to read it with a remote operation

So by now we have all the threads on node1 waiting for the classloader, which in turn is waiting for the PluginStateStore, which is stored on another node so is waiting for a remote cache operation.

The stack trace for the thread attempting the operation:

"async-entry-listener-19" daemon prio=10 tid=0x00007f4098a90800 nid=0x7d26 in Object.wait() [0x00007f3ff39bc000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at com.hazelcast.spi.impl.BasicInvocationFuture.waitForResponse(BasicInvocationFuture.java:194)
	- locked <0x000000077d63ea60> (a com.hazelcast.spi.impl.BasicInvocationFuture)
	at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:166)
	at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:146)
	at com.hazelcast.map.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:337)
	at com.hazelcast.map.proxy.MapProxySupport.getInternal(MapProxySupport.java:240)
	at com.hazelcast.map.proxy.MapProxyImpl.get(MapProxyImpl.java:82)
	at com.atlassian.confluence.cache.hazelcast.nearcache.NearCacheLoader.load(NearCacheLoader.java:25)
	at com.atlassian.confluence.cache.hazelcast.monitoring.ConfluenceMonitoringLoader.load(ConfluenceMonitoringLoader.java:37)
	at com.atlassian.cache.ehcache.EhCacheManager$3$1.createEntry(EhCacheManager.java:168)
	at net.sf.ehcache.constructs.blocking.SelfPopulatingCache.get(SelfPopulatingCache.java:73)
	at com.atlassian.cache.ehcache.DelegatingCache.get(DelegatingCache.java:74)
	at com.atlassian.confluence.cache.ehcache.DefaultConfluenceEhCache.get(DefaultConfluenceEhCache.java:36)
	at com.atlassian.confluence.cache.ConfluenceMonitoringCache.get(ConfluenceMonitoringCache.java:43)
	at com.atlassian.confluence.cache.hazelcast.nearcache.ConfluenceNearCache.get(ConfluenceNearCache.java:57)
	at com.atlassian.confluence.cache.ConfluenceMonitoringCache.get(ConfluenceMonitoringCache.java:43)
	at com.atlassian.confluence.cache.TransactionalCacheFactory$TransactionAwareCache.get(TransactionalCacheFactory.java:335)
	at com.atlassian.confluence.plugin.CachingPluginStateStore.load(CachingPluginStateStore.java:47)
	at sun.reflect.GeneratedMethodAccessor62.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:307)
	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:198)
	at com.sun.proxy.$Proxy28.load(Unknown Source)
	at com.atlassian.plugin.manager.DefaultPluginManager.getState(DefaultPluginManager.java:956)
	at com.atlassian.plugin.manager.DefaultPluginManager.isPluginEnabled(DefaultPluginManager.java:1987)
	at com.atlassian.plugin.classloader.PluginsClassLoader.isPluginEnabled(PluginsClassLoader.java:211)
	at com.atlassian.plugin.classloader.PluginsClassLoader.findClass(PluginsClassLoader.java:104)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	- locked <0x00000006b2cedc48> (a com.atlassian.plugin.classloader.PluginsClassLoader)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at com.atlassian.confluence.util.ConfluenceUberClassLoader.loadClass(ConfluenceUberClassLoader.java:83)
	at com.hazelcast.nio.ClassLoaderUtil.tryLoadClass(ClassLoaderUtil.java:124)
	at com.hazelcast.nio.ClassLoaderUtil.loadClass(ClassLoaderUtil.java:97)
	at com.hazelcast.nio.IOUtil$1.resolveClass(IOUtil.java:98)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
	at com.hazelcast.nio.serialization.DefaultSerializers$ObjectSerializer.read(DefaultSerializers.java:196)
	at com.hazelcast.nio.serialization.StreamSerializerAdapter.read(StreamSerializerAdapter.java:63)
	at com.hazelcast.nio.serialization.SerializationServiceImpl.toObject(SerializationServiceImpl.java:261)
	at com.hazelcast.map.DataAwareEntryEvent.getKey(DataAwareEntryEvent.java:62)
	at com.atlassian.confluence.cache.hazelcast.nearcache.NearCacheEntryListener.invalidateIfNonLocal(NearCacheEntryListener.java:72)
	at com.atlassian.confluence.cache.hazelcast.nearcache.NearCacheEntryListener.entryEvicted(NearCacheEntryListener.java:45)
	at com.atlassian.confluence.cache.hazelcast.DefaultAsyncEntryListenerGenerator$CallDelegateTask.run(DefaultAsyncEntryListenerGenerator.java:187)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)

   Locked ownable synchronizers:
	- <0x00000006b3bf55e8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
	- <0x00000006c69d7090> (a java.util.concurrent.ThreadPoolExecutor$Worker)

node2
Node 2's threads were all stuck trying to load the same cache value (from a Hibernate L2 cache) from a remote cache (node1's). Since all of node1's HZ listener threads were used up waiting, node2's threads got stuck too, waiting for a response from node1. Deadlock

What we can do about it:
The crux of the problem is what is happening on node1:

Cache Operation -> load class -> read plugin state -> remote cache operation

This circular dependence on cache operations is what leaves us open to deadlocks. This is particularly an issue because the PluginStateStore is needed for pretty much everything.

oburn had a great idea: Make the PluginStateStore NOT a regular cache, but a local data store, always. Have it listen to application events to keep itself up to date asynchronously (we do this anyway with plugin events). Ensure it always has a value, so reading the plugin system state is always reliable, and fast. This would allow the threads on node1 to untangle themselves and "bottom out" at a local call, rather than a remote one.

Attachments

Issue Links

relates to: HOT-24826 Loading...

Confluence Data Center nodes can get into a deadlocked state

Details

Description

Attachments

Issue Links

Activity

People

Dates