Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-58181

Confluence cluster can become unresponsive when hazelcast must return information related to many labels

      Issue Summary

      Confluence can become unresponsive, waiting on hazelcast to return information relating to labels. This issue is only seen when using the Midori Better Archiving plugin on a Confluence cluster with ~70000 users.

      Environment

      Confluence 6.6.9 Data Center with 4 nodes
      Midori Better Archiving plugin v 8.1.0

      Steps to Reproduce

      The steps to reproduce are difficult to articulate, as the issue occurs while Confluence is under normal usage, albeit with a large number of users. jxie and vhu@atlassian.com have both looked at this previously, and they determined that when the issue is occurring, Hazelcast appears to be overwhelmed. It is believed that the plugin is overwhelming hazelcast with additional requests to the labels cache. Confluence already does a labels fetch on page load (as indicated by the threads similarly waiting on the hazelcast cache for labels information after the plugin had been disabled), but the midori plugin is at least doubling the number of requests to the cache for the same information by checking labels for every web item that might be displayed (the list provided to getDisplayableItems).

      It does seem like threads are being processed (moving from WAITING to RUNNABLE), but it's happening quite slowly, which could indicate Hazelcast is backed up or that some nodes holding label cache data are responsive while others aren't.

      Expected Results

      Confluence should remain stable while the Midori Better Archiving plugin is enabled, even when Confluence is under high load

      Actual Results

      Here are some sample threads that show the behavior we see when the issue is occurring:

      "http-nio-8090-exec-216 url:/display/EXAM/Page+Name username:charlie url:/pages/viewpage.action username:charlie" #84425 daemon prio=5 tid=0x00007effcc04a000 nid=0x1efee waiting on condition [0x00007efd98398000]
         java.lang.Thread.State: WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
      	at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:153)
      	at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:376)
      	at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:307)
      	at com.hazelcast.map.impl.proxy.NearCachedMapProxyImpl.getInternal(NearCachedMapProxyImpl.java:101)
      	at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:112)
      	at com.atlassian.cache.hazelcast.HazelcastCache.getOrLoad(HazelcastCache.java:243)
      	at com.atlassian.cache.hazelcast.HazelcastCache.get(HazelcastCache.java:95)
      	at com.atlassian.confluence.cache.DefaultConfluenceCache.get(DefaultConfluenceCache.java:47)
      	at com.atlassian.confluence.cache.ConfluenceMonitoringCache.get(ConfluenceMonitoringCache.java:45)
      	at com.atlassian.confluence.cache.DefaultConfluenceCache.get(DefaultConfluenceCache.java:47)
      	at com.atlassian.confluence.cache.ConfluenceMonitoringCache.get(ConfluenceMonitoringCache.java:45)
      	at com.atlassian.confluence.cache.hazelcast.hibernate.HazelcastHibernateCache.get(HazelcastHibernateCache.java:32)
      	at com.atlassian.hibernate.extras.tangosol.CoherenceCacheStrategy.get(CoherenceCacheStrategy.java:68)
      	at com.atlassian.confluence.cache.hibernate.SwitchableCacheConcurrencyStrategy.get(SwitchableCacheConcurrencyStrategy.java:29)
      	at com.atlassian.confluence.cache.hibernate.ConfluenceCacheStrategy.get(ConfluenceCacheStrategy.java:49)
      	at com.atlassian.hibernate.adapter.adapters.cache.EntityRegionAccessStrategyV5Adapter.get(EntityRegionAccessStrategyV5Adapter.java:51)
      	at org.hibernate.engine.internal.CacheHelper.fromSharedCache(CacheHelper.java:32)
      	at org.hibernate.event.internal.DefaultLoadEventListener.getFromSharedCache(DefaultLoadEventListener.java:651)
      	at org.hibernate.event.internal.DefaultLoadEventListener.loadFromSecondLevelCache(DefaultLoadEventListener.java:595)
      	at org.hibernate.event.internal.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:462)
      	at org.hibernate.event.internal.DefaultLoadEventListener.load(DefaultLoadEventListener.java:219)
      	...
      	at com.atlassian.confluence.util.LabelUtil.extractLabelsFromLabellings(LabelUtil.java:625)
      	at com.atlassian.confluence.core.AbstractLabelableEntityObject.getLabels(AbstractLabelableEntityObject.java:26)
      	...
      	at com.atlassian.confluence.spaces.SpaceDescription_$$_jvstabe_15.getLabels(SpaceDescription_$$_jvstabe_15.java)
      	at com.midori.confluence.plugin.archiving.service.support.ContentLifecycleLabelHelper.findSpaceLabel(ContentLifecycleLabelHelper.java:156)
      	at com.midori.confluence.plugin.archiving.service.support.ContentLifecycleLabelHelper.isBlacklistedSpace(ContentLifecycleLabelHelper.java:112)
      	at com.midori.confluence.plugin.archiving.util.NonBlacklistedGlobalFreshSpacePredicate.apply(NonBlacklistedGlobalFreshSpacePredicate.java:15)
      	at com.midori.confluence.plugin.archiving.web.condition.NonBlacklistedGlobalFreshSpaceCondition.shouldDisplay(NonBlacklistedGlobalFreshSpaceCondition.java:15)
      	at com.atlassian.confluence.plugin.descriptor.web.conditions.BaseConfluenceCondition.shouldDisplay(BaseConfluenceCondition.java:33)
      	at com.atlassian.plugin.web.conditions.AndCompositeCondition.shouldDisplay(AndCompositeCondition.java:14)
      	at com.atlassian.plugin.web.conditions.AndCompositeCondition.shouldDisplay(AndCompositeCondition.java:14)
      	at com.atlassian.plugin.web.DefaultWebInterfaceManager.filterFragmentsByCondition(DefaultWebInterfaceManager.java:154)
      	at com.atlassian.plugin.web.DefaultWebInterfaceManager.getDisplayableItems(DefaultWebInterfaceManager.java:107)
      	at com.atlassian.confluence.plugin.descriptor.web.ConfluenceWebInterfaceManager.getDisplayableItems(ConfluenceWebInterfaceManager.java:74)
      	...
      	at com.atlassian.confluence.plugins.pagebanner.PageBannerContextProvider.getBannerItems(PageBannerContextProvider.java:72)
      	at com.atlassian.confluence.plugins.pagebanner.PageBannerContextProvider.getContextMap(PageBannerContextProvider.java:57)
      	at com.atlassian.plugin.web.descriptors.DefaultWebPanelModuleDescriptor$ContextAwareWebPanel.getHtml(DefaultWebPanelModuleDescriptor.java:144)
      	...
      	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:684)
      	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1539)
      	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1495)
      	- locked <0x00007f03be400550> (a org.apache.tomcat.util.net.NioChannel)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
      	at java.lang.Thread.run(Thread.java:748)
      "http-nio-8090-exec-203 url:/display/EXAM/Page+Name username:charlie url:/pages/viewpage.action username:charlie" #14472 daemon prio=5 tid=0x00007ffac00fd800 nid=0x1c1e4 waiting on condition [0x00007fec4e98d000]
         java.lang.Thread.State: WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
      	at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:153)
      	at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:376)
      	at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:307)
      	at com.hazelcast.map.impl.proxy.NearCachedMapProxyImpl.getInternal(NearCachedMapProxyImpl.java:101)
      	at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:112)
      	at com.atlassian.cache.hazelcast.HazelcastCache.getOrLoad(HazelcastCache.java:243)
      	at com.atlassian.cache.hazelcast.HazelcastCache.get(HazelcastCache.java:95)
      	at com.atlassian.confluence.cache.DefaultConfluenceCache.get(DefaultConfluenceCache.java:47)
      	at com.atlassian.confluence.cache.ConfluenceMonitoringCache.get(ConfluenceMonitoringCache.java:45)
      	at com.atlassian.confluence.cache.DefaultConfluenceCache.get(DefaultConfluenceCache.java:47)
      	at com.atlassian.confluence.cache.ConfluenceMonitoringCache.get(ConfluenceMonitoringCache.java:45)
      	at com.atlassian.confluence.cache.hazelcast.hibernate.HazelcastHibernateCache.get(HazelcastHibernateCache.java:32)
      	at com.atlassian.hibernate.extras.tangosol.CoherenceCacheStrategy.get(CoherenceCacheStrategy.java:68)
      	at com.atlassian.confluence.cache.hibernate.SwitchableCacheConcurrencyStrategy.get(SwitchableCacheConcurrencyStrategy.java:29)
      	at com.atlassian.confluence.cache.hibernate.ConfluenceCacheStrategy.get(ConfluenceCacheStrategy.java:49)
      	at com.atlassian.hibernate.adapter.adapters.cache.EntityRegionAccessStrategyV5Adapter.get(EntityRegionAccessStrategyV5Adapter.java:51)
      	at org.hibernate.engine.internal.CacheHelper.fromSharedCache(CacheHelper.java:32)
      	at org.hibernate.event.internal.DefaultLoadEventListener.getFromSharedCache(DefaultLoadEventListener.java:651)
      	at org.hibernate.event.internal.DefaultLoadEventListener.loadFromSecondLevelCache(DefaultLoadEventListener.java:595)
      	at org.hibernate.event.internal.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:462)
      	at org.hibernate.event.internal.DefaultLoadEventListener.load(DefaultLoadEventListener.java:219)
      	...
      	at com.atlassian.confluence.user.ConfluenceUserImpl_$$_jvstb24_3.getName(ConfluenceUserImpl_$$_jvstb24_3.java)
      	at com.atlassian.confluence.labels.Label.getOwner(Label.java:90)
      	at com.atlassian.confluence.labels.LabelParser.create(LabelParser.java:189)
      	at com.atlassian.confluence.labels.LabelParser.render(LabelParser.java:193)
      	at com.atlassian.confluence.labels.Label.getDisplayTitle(Label.java:170)
      	at com.atlassian.confluence.util.LabelUtil.extractLabelsFromLabellings(LabelUtil.java:626)
      	at com.atlassian.confluence.core.AbstractLabelableEntityObject.getLabels(AbstractLabelableEntityObject.java:26)
      	...
      	at com.atlassian.confluence.spaces.SpaceDescription_$$_jvstb24_15.getLabels(SpaceDescription_$$_jvstb24_15.java)
      	at com.atlassian.confluence.userstatus.DefaultFavouriteManager.isUserFavourite(DefaultFavouriteManager.java:48)
      	...
      	at com.atlassian.confluence.plugins.ia.ui.DefaultSoySidebarContextProvider.populateWithMainSidebarData(DefaultSoySidebarContextProvider.java:201)
      	at com.atlassian.confluence.plugins.ia.ui.DefaultSoySidebarContextProvider.getContextMap(DefaultSoySidebarContextProvider.java:116)
      	...
      	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:684)
      	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1539)
      	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1495)
      	- locked <0x00007ff290402f40> (a org.apache.tomcat.util.net.NioChannel)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
      	at java.lang.Thread.run(Thread.java:748)
      "http-nio-8090-exec-220 url:/display/EXAM/Page+Name username:charlie url:/pages/viewpage.action username:charlie" #84429 daemon prio=5 tid=0x00007effcc163800 nid=0x1eff2 runnable [0x00007efd8dfa9000]
         java.lang.Thread.State: RUNNABLE
      	at java.lang.System.identityHashCode(Native Method)
      	at java.util.IdentityHashMap.hash(IdentityHashMap.java:295)
      	at java.util.IdentityHashMap.get(IdentityHashMap.java:329)
      	at org.hibernate.engine.internal.EntityEntryContext.getAssociatedManagedEntity(EntityEntryContext.java:177)
      	at org.hibernate.engine.internal.EntityEntryContext.addEntityEntry(EntityEntryContext.java:91)
      	at org.hibernate.engine.internal.StatefulPersistenceContext.addEntry(StatefulPersistenceContext.java:544)
      	at org.hibernate.engine.internal.StatefulPersistenceContext.addEntity(StatefulPersistenceContext.java:471)
      	at org.hibernate.engine.internal.TwoPhaseLoad.addUninitializedCachedEntity(TwoPhaseLoad.java:385)
      	at org.hibernate.event.internal.DefaultLoadEventListener.convertCacheEntryToEntity(DefaultLoadEventListener.java:738)
      	at org.hibernate.event.internal.DefaultLoadEventListener.processCachedEntry(DefaultLoadEventListener.java:625)
      	at org.hibernate.event.internal.DefaultLoadEventListener.loadFromSecondLevelCache(DefaultLoadEventListener.java:602)
      	at org.hibernate.event.internal.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:462)
      	at org.hibernate.event.internal.DefaultLoadEventListener.load(DefaultLoadEventListener.java:219)
      	at org.hibernate.event.internal.DefaultLoadEventListener.doOnLoad(DefaultLoadEventListener.java:116)
      	at org.hibernate.event.internal.DefaultLoadEventListener.onLoad(DefaultLoadEventListener.java:89)
      	at org.hibernate.internal.SessionImpl.fireLoad(SessionImpl.java:1230)
      	at org.hibernate.internal.SessionImpl.immediateLoad(SessionImpl.java:1088)
      	at org.hibernate.proxy.AbstractLazyInitializer.initialize(AbstractLazyInitializer.java:155)
      	at org.hibernate.proxy.AbstractLazyInitializer.getImplementation(AbstractLazyInitializer.java:259)
      	at org.hibernate.proxy.pojo.javassist.JavassistLazyInitializer.invoke(JavassistLazyInitializer.java:73)
      	at com.atlassian.hibernate.adapter.proxy.JavassistLazyInitializer_ImplementV2Proxy.invoke(JavassistLazyInitializer_ImplementV2Proxy.java:68)
      	at com.atlassian.confluence.user.ConfluenceUserImpl_$$_jvstabe_3.getName(ConfluenceUserImpl_$$_jvstabe_3.java)
      	at com.atlassian.confluence.labels.Label.getOwner(Label.java:90)
      	at com.atlassian.confluence.labels.LabelParser.create(LabelParser.java:189)
      	at com.atlassian.confluence.labels.LabelParser.render(LabelParser.java:193)
      	at com.atlassian.confluence.labels.Label.getDisplayTitle(Label.java:170)
      	at com.atlassian.confluence.util.LabelUtil.extractLabelsFromLabellings(LabelUtil.java:626)
      	at com.atlassian.confluence.core.AbstractLabelableEntityObject.getLabels(AbstractLabelableEntityObject.java:26)
      	...
      	at com.atlassian.confluence.spaces.SpaceDescription_$$_jvstabe_15.getLabels(SpaceDescription_$$_jvstabe_15.java)
      	at com.midori.confluence.plugin.archiving.service.support.ContentLifecycleLabelHelper.findSpaceLabel(ContentLifecycleLabelHelper.java:156)
      	at com.midori.confluence.plugin.archiving.service.support.ContentLifecycleLabelHelper.isBlacklistedSpace(ContentLifecycleLabelHelper.java:112)
      	at com.midori.confluence.plugin.archiving.util.NonBlacklistedGlobalFreshSpacePredicate.apply(NonBlacklistedGlobalFreshSpacePredicate.java:15)
      	at com.midori.confluence.plugin.archiving.web.condition.NonBlacklistedGlobalFreshSpaceCondition.shouldDisplay(NonBlacklistedGlobalFreshSpaceCondition.java:15)
      	at com.atlassian.confluence.plugin.descriptor.web.conditions.BaseConfluenceCondition.shouldDisplay(BaseConfluenceCondition.java:33)
      	at com.atlassian.plugin.web.conditions.AndCompositeCondition.shouldDisplay(AndCompositeCondition.java:14)
      	at com.atlassian.plugin.web.conditions.AndCompositeCondition.shouldDisplay(AndCompositeCondition.java:14)
      	at com.atlassian.plugin.web.DefaultWebInterfaceManager.filterFragmentsByCondition(DefaultWebInterfaceManager.java:154)
      	at com.atlassian.plugin.web.DefaultWebInterfaceManager.getDisplayableItems(DefaultWebInterfaceManager.java:107)
      	at com.atlassian.confluence.plugin.descriptor.web.ConfluenceWebInterfaceManager.getDisplayableItems(ConfluenceWebInterfaceManager.java:74)
      	...
      	at com.atlassian.confluence.plugins.pagebanner.PageBannerContextProvider.getBannerItems(PageBannerContextProvider.java:72)
      	at com.atlassian.confluence.plugins.pagebanner.PageBannerContextProvider.getContextMap(PageBannerContextProvider.java:57)
      	at com.atlassian.plugin.web.descriptors.DefaultWebPanelModuleDescriptor$ContextAwareWebPanel.getHtml(DefaultWebPanelModuleDescriptor.java:144)
      	...
      	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:684)
      	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1539)
      	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1495)
      	- locked <0x00007f03af801258> (a org.apache.tomcat.util.net.NioChannel)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
      	at java.lang.Thread.run(Thread.java:748)

      Notes

      Is it possible that this is a bug in hazelcast? Perhaps https://github.com/hazelcast/hazelcast/issues/13170

      Workaround

      Disable Midori Better Archiving

          Form Name

            [CONFSERVER-58181] Confluence cluster can become unresponsive when hazelcast must return information related to many labels

            Oh, there is one more important note: as explained above, our app checks the labels of the space which contains the viewed page. It may not be redundant at all, because other mechanisms (I assume) check the labels of the viewed page itself.

            Aron Gombas [Midori] added a comment - Oh, there is one more important note: as explained above, our app checks the labels of the space which contains the viewed page. It may not be redundant at all, because other mechanisms (I assume) check the labels of the viewed page itself.

            My name is Aron Gombas. I am the lead developer of the Better Content Archiving app.

            First, let me explain why we need to check the labels that many times.
            As you may know, the Content Status Indicator is displayed in the top of all pages that are in a global space which is not in the blacklist. That means, that when the page is opened, Confluence calls our condition that decides whether to display that page fragment. The condition gets the space in which the current page is located, then gets its labels and checks if "noarchive" is among those (which would mean that space is blacklisted). This call sequence can also be seen in your stack trace.

            So the label checks cannot be "saved" and cannot be made less frequently.

            One thing we could do is introducing some sort of caching mechanism, but this is counter-intuitive. If Confluence caches the labels and the cache is distributed in the cluster, then adding our secondary cache would just introduce new sources of problems (what if the cache consistency is lost anywhere?) and unnecessary complexity. I really think that it should be Confluence Data Center's responsibility to guarantee that the cache implementation is efficient.

            If there is anything we can do in our side, let me know.

             

            Aron Gombas [Midori] added a comment - My name is Aron Gombas. I am the lead developer of the Better Content Archiving app. First, let me explain why we need to check the labels that many times. As you may know, the Content Status Indicator is displayed in the top of all pages that are in a global space which is not in the blacklist. That means, that when the page is opened, Confluence calls our condition that decides whether to display that page fragment. The condition gets the space in which the current page is located, then gets its labels and checks if "noarchive" is among those (which would mean that space is blacklisted). This call sequence can also be seen in your stack trace. So the label checks cannot be "saved" and cannot be made less frequently. One thing we could do is introducing some sort of caching mechanism, but this is counter-intuitive. If Confluence caches the labels and the cache is distributed in the cluster, then adding our secondary cache would just introduce new sources of problems (what if the cache consistency is lost anywhere?) and unnecessary complexity. I really think that it should be Confluence Data Center's responsibility to guarantee that the cache implementation is efficient. If there is anything we can do in our side, let me know.  

              Unassigned Unassigned
              jkennemer Jonathon
              Affected customers:
              4 This affects my team
              Watchers:
              11 Start watching this issue

                Created:
                Updated: