Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-64908

UPM actions may flush internal caches, leading to performance problems

    • 6.04
    • 142
    • Severity 1 - Critical
    • 2,510
    • Hide
      Atlassian Update – 26 April 2019

      Dear Jira users,

      we’re glad to announce that this issue will be addressed in our upcoming 8.2.0 release!

      We’ve fixed several issues that caused performance degradation in large Jira instances when installing or updating apps. If you’ve encountered this problem before, it’s solved, so feel free to update your apps anytime. For technical details please go this comment.

       

      Looking forward for your comments.

      Thank you,
      Grazyna Kaszkur

      Product Manager,
      Jira Server and Data Center

      Show
      Atlassian Update – 26 April 2019 Dear Jira users, we’re glad to announce that this issue will be addressed in our upcoming 8.2.0 release! We’ve fixed several issues that caused performance degradation in large Jira instances when installing or updating apps. If you’ve encountered this problem before, it’s solved, so feel free to update your apps anytime. For technical details please go this comment.   Looking forward for your comments. Thank you, Grazyna Kaszkur
 Product Manager, Jira Server and Data Center

      Summary

      During plugin (add-on) installs/updates (including license updates) /delete/disable action UPM will flush the large number of caches. Under high load that will cause significant JIRA performance degradation and might cause JIRA to freeze/stall for a long time.

      Steps to Reproduce

      1. Apply high load to JIRA
      2. Upgrade any plugin.

      Expected Results

      • JIRA performs fast, could be slight performance degradation.
      • JIRA admin is notified about the impact

      Actual Results

      • JIRA admin is not notified about the impact
      • Thread dump might show the following thread UpmAsynchronousTaskManager:thread
        "UpmAsynchronousTaskManager:thread-1" #38891 prio=5 os_prio=0 tid=0x00007fbda8047800 nid=0x6712 waiting on condition [0x00007fbcf9cd5000]
           java.lang.Thread.State: WAITING (parking)
        	at sun.misc.Unsafe.park(Native Method)
        	- parking to wait for  <0x000000056c734418> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        ...
        	at net.sf.ehcache.concurrent.ReadWriteLockSync.lock(ReadWriteLockSync.java:50)
        	at com.atlassian.cache.ehcache.LoadingCache.remove(LoadingCache.java:190)
        	at com.atlassian.cache.ehcache.DelegatingCachedReference.reset(DelegatingCachedReference.java:79)
        	at com.atlassian.jira.issue.search.managers.DefaultSearchHandlerManager.refresh(DefaultSearchHandlerManager.java:290)
        	at com.atlassian.jira.issue.search.managers.DefaultIssueSearcherManager.refresh(DefaultIssueSearcherManager.java:46)
        	at com.atlassian.jira.issue.fields.DefaultFieldManager.refreshSearchersAndIndexers(DefaultFieldManager.java:804)
        	at com.atlassian.jira.issue.fields.DefaultFieldManager.refresh(DefaultFieldManager.java:706)
        	at com.atlassian.jira.plugin.JiraCacheResetter$Delegate.resetCaches(JiraCacheResetter.java:83)
        	at com.atlassian.jira.plugin.JiraCacheResetter$Delegate.onPluginModuleDisabled(JiraCacheResetter.java:53)
        ...
        	at com.atlassian.plugin.event.impl.DefaultPluginEventManager.broadcast(DefaultPluginEventManager.java:73)
        	at com.atlassian.plugin.manager.DefaultPluginManager.broadcastIgnoreError(DefaultPluginManager.java:2081)
        	at com.atlassian.plugin.manager.DefaultPluginManager.publishModuleDisabledEvents(DefaultPluginManager.java:1955)
        	at com.atlassian.plugin.manager.DefaultPluginManager.disablePluginModuleNoPersist(DefaultPluginManager.java:1910)
        	at com.atlassian.plugin.manager.DefaultPluginManager.disablePluginModules(DefaultPluginManager.java:1904)
        	at com.atlassian.plugin.manager.DefaultPluginManager.disablePluginWithModuleEvents(DefaultPluginManager.java:1860)
        	at com.atlassian.plugin.manager.DefaultPluginManager.notifyPluginDisabled(DefaultPluginManager.java:1888)
        	at com.atlassian.plugin.manager.DefaultPluginManager.unloadPlugin(DefaultPluginManager.java:1022)
        	at com.atlassian.plugin.manager.DefaultPluginManager.uninstallNoEvent(DefaultPluginManager.java:981)
        	at com.atlassian.plugin.manager.DefaultPluginManager.updatePlugin(DefaultPluginManager.java:1343)
        	at com.atlassian.plugin.manager.DefaultPluginManager.addPlugins(DefaultPluginManager.java:1133)
        	at com.atlassian.jira.plugin.JiraPluginManager.addPlugins(JiraPluginManager.java:150)
        	at com.atlassian.plugin.manager.DefaultPluginManager.scanForNewPlugins(DefaultPluginManager.java:903)
        	at com.atlassian.plugin.manager.DefaultPluginManager.installPlugins(DefaultPluginManager.java:821)
        	at com.atlassian.jira.plugin.JiraPluginManager.installPlugins(JiraPluginManager.java:168)
        ...
        

      Notes

      Problem is amplified at JIRA Datacenter. Cache flush is triggered during node startup as UPM needs to register and load modules from plugins. See JRASERVER-66839

      Workaround

      Please plan any plugin actions updates/delete/disable (UPM actions) during low peak hours or maintenance windows. See Best Practices for Managing JIRA Application Add-ons for recommendations.
       

        1. plugin_disable_80.png
          plugin_disable_80.png
          341 kB
        2. plugin_disable_82.png
          plugin_disable_82.png
          133 kB

            [JRASERVER-64908] UPM actions may flush internal caches, leading to performance problems

            Hi,

            Thank you for the recent comments.

            We want to inform you that we have decided not to backport this fix to the 7.13 Enterprise Release version. While we’ve been working to backport this fix in recent months, after careful analysis, we ultimately feel that the change would bring too much risk to the stability of the release.

            We have made this tough call due to the fact that the changes introduced in 8.2 were complex, scattered over many libraries and codebases. Additionally, backporting this particular change over the platform release involved many moving parts. All this made us realise that the risk we might bring with this change is far too severe and we might cause even more problems if we decide to go forward with this backport.

            We are aware that you were waiting for this update and our decision may be disappointing to you. We would like to reassure you that we will keep backporting crucial bugs to our long term support versions as we have stated here: https://confluence.atlassian.com/enterprise/atlassian-enterprise-releases-948227420.html, simply in some cases this brings more risks than benefits that we hadn’t originally anticipated.

            If you are impacted by this bug, we recommend upgrading to our newest Enterprise release, version 8.5.x. Please follow our upgrade guide for Jira 8.5: https://confluence.atlassian.com/jiracore/jira-8-5-enterprise-release-upgrade-guide-976161392.html.

            Best,

            Jira Server and Data Center product management

            Grazyna Kaszkur added a comment - Hi, Thank you for the recent comments. We want to inform you that we have decided not to backport this fix to the 7.13 Enterprise Release version. While we’ve been working to backport this fix in recent months, after careful analysis, we ultimately feel that the change would bring too much risk to the stability of the release. We have made this tough call due to the fact that the changes introduced in 8.2 were complex, scattered over many libraries and codebases. Additionally, backporting this particular change over the platform release involved many moving parts. All this made us realise that the risk we might bring with this change is far too severe and we might cause even more problems if we decide to go forward with this backport. We are aware that you were waiting for this update and our decision may be disappointing to you. We would like to reassure you that we will keep backporting crucial bugs to our long term support versions as we have stated here: https://confluence.atlassian.com/enterprise/atlassian-enterprise-releases-948227420.html , simply in some cases this brings more risks than benefits that we hadn’t originally anticipated. If you are impacted by this bug, we recommend upgrading to our newest Enterprise release, version 8.5.x. Please follow our upgrade guide for Jira 8.5: https://confluence.atlassian.com/jiracore/jira-8-5-enterprise-release-upgrade-guide-976161392.html . Best, Jira Server and Data Center product management

            ivan.balandzin,
            thanks for asking. We're currently at the phase of testing this backport. It's not a trivial fix, so unfortunately it takes time.

            Kamil Cichy (Inactive) added a comment - ivan.balandzin , thanks for asking. We're currently at the phase of testing this backport. It's not a trivial fix, so unfortunately it takes time.

            Hello gkaszkur,
            Any news on backport of fix to 7.13.* since November? Month has passed...

            Ivan Balandzin added a comment - Hello  gkaszkur , Any news on backport of fix to 7.13.* since November? Month has passed...

            Hi richard.carini,

            thank you for your question. I can confirm that the team is working on backporting the fix to 7.13.

            Unfortunately I can not provide any exact date or fix version when it will be available, because it’s a considerable chunk of work which touches multiple projects. Moreover 7.13 line is an Enterprise Release, that is why, we will need to pay extraordinary care that the changes don’t cause any regressions, so testing will have to be more rigid.

            Please stay tuned and watch this ticket for further updates. 

            Thank you, 

            Grażyna

             

            Grazyna Kaszkur added a comment - Hi richard.carini , thank you for your question. I can confirm that the team is working on backporting the fix to 7.13. Unfortunately I can not provide any exact date or fix version when it will be available, because it’s a considerable chunk of work which touches multiple projects. Moreover 7.13 line is an Enterprise Release, that is why, we will need to pay extraordinary care that the changes don’t cause any regressions, so testing will have to be more rigid. Please stay tuned and watch this ticket for further updates.  Thank you,  Grażyna  

            Hi gkaszkur,

            Any news regarding which 7.13.x ER this is getting into? I am needing to schedule a 7.13 update and would like to wait for this fix.

            Thanks again!
            Rick

            Rick Carini added a comment - Hi gkaszkur , Any news regarding which 7.13.x ER this is getting into? I am needing to schedule a 7.13 update and would like to wait for this fix. Thanks again! Rick

            Hey baldin1697836153
            I wanted to confirm that we are aiming to backport this fix to 7.13 Enterprise Release, however we do not have a specific 7.13 ER version yet. This is due to the fact that we have decided to extend the soaking period for this fix as a stability precaution before the backport.
            Please expect another update from our side around October/ early November 2019.

            Thank you.

            Grazyna Kaszkur added a comment - Hey baldin1697836153 ,  I wanted to confirm that we are aiming to backport this fix to 7.13 Enterprise Release, however we do not have a specific 7.13 ER version yet. This is due to the fact that we have decided to extend the soaking period for this fix as a stability precaution before the backport. Please expect another update from our side around October/ early November 2019. Thank you.

            Has this been backported to a 7.13 ER?

            Dom Baldin [Adobe] added a comment - Has this been backported to a 7.13 ER?

            Will you also be back porting to the other enterprise version, 7.6.x?

            Is there a similar issue for Confluence, or given the difference in how Caching is done in each product mean that this is a non-issue there?

            Thanks!

             

             

            Rick Carini added a comment - Will you also be back porting to the other enterprise version, 7.6.x? Is there a similar issue for Confluence, or given the difference in how Caching is done in each product mean that this is a non-issue there? Thanks!    

            Fix description

            Problem

            There is a pattern across Jira where caches are cleared when there is a change in the plugin system. A single user action on the plugin system: installing, enabling, disabling, updating a plugin could trigger thousands of plugin events and each could result in clearing some caches. This, combined with incoming user traffic which was triggering loading caches, seriously affected the performance of both the plugin action and the user request time. There was also a problem with disabling plugins in an incorrect order. This, because of plugin dependencies and plugins providing content to extension points, could seriously affect page load time. In extreme situations, a plugin action affecting many modules could affect the stability of Jira for a long time. 

            Jira 8.0: Example of disabling multi module plugin with dependencies on Jira Server, 1M issues, 400 concurrent users:

            Solution

            1. We have limited the number of plugin events which are used for clearing caches. A user action on the plugin system, potentially  triggering thousands of plugin events, is wrapped into a "transaction", and now clearing caches is done based on just a few transactions triggered by this action.
            2. In Jira DC, this new transaction event that triggers the clearing of caches is always run with cache replication turned off. Therefore, caches are not unnecessarily cleared multiple times on each node.
            3. We have fixed the order of shutting down plugins to respect dependencies
            4. There is now a plugin shutdown timeout (500ms) - required when shutting down plugins with cyclic dependencies 

            With these fixes, the same long-running operations from before now have very little effect on response times, and the time to perform plugin operations is much shorter.

            Jira 8.2: Example of disabling multi module plugin with dependencies on Jira Server, 1M issues, 400 concurrent users:

            New logs

            INFO

            INFO [plugin-transaction] numberStartEvents:111, numberEndEvents:111, numberSendEvents:111, numberEventsInTransactions:39942 

            This is logged every 5min and shows statistics related to the new plugin transaction event. 

            • numberSendEvents - can be read as the number of triggered cache clears
            • numberEventsInTransactions - can be read as the number of plugin events which were part of the transactions (i.e. before would trigger cache clears)

            In this example we can see that on this Jira instance, since Jira started cache clearing events were triggered 111 times instead of 39942.

            DEBUG

            com.atlassian.jira.plugin.PluginTransactionListener -  this will print statistic for every plugin event transaction

            DEBUG - [plugin-transaction] transaction starts with event:PluginTransactionStartEvent
            DEBUG [plugin-transaction] transaction ends with event:PluginTransactionEndEvent, numberEventsInTransaction:6007, firstEvent: PluginDisablingEvent for plugin-test, lastEvent: PluginDisabledEvent for plugin-test

            Versions

            This fix will be available in Jira 8.2.0 release. Soon after 8.2 release we are planning to back port it to 7.13.x.

             

            Maciej Swinarski (Inactive) added a comment - - edited Fix description Problem There is a pattern across Jira where caches are cleared when there is a change in the plugin system. A single user action on the plugin system: installing, enabling, disabling, updating a plugin could trigger thousands of plugin events and each could result in clearing some caches. This, combined with incoming user traffic which was triggering loading caches, seriously affected the performance of both the plugin action and the user request time. There was also a problem with disabling plugins in an incorrect order. This, because of plugin dependencies and plugins providing content to extension points, could seriously affect page load time. In extreme situations, a plugin action affecting many modules could affect the stability of Jira for a long time.  Jira 8.0:   Example of disabling multi module plugin with dependencies on Jira Server, 1M issues, 400 concurrent users: Solution We have limited the number of plugin events which are used for clearing caches. A user action on the plugin system, potentially  triggering thousands of plugin events, is wrapped into a "transaction", and now clearing caches is done based on just a few transactions triggered by this action. In Jira DC, this new transaction event that triggers the clearing of caches is always run with cache replication turned off. Therefore, caches are not unnecessarily cleared multiple times on each node. We have fixed the order of shutting down plugins to respect dependencies There is now a plugin shutdown timeout (500ms) - required when shutting down plugins with cyclic dependencies  With these fixes, the same long-running operations from before now have very little effect on response times, and the time to perform plugin operations is much shorter. Jira 8.2:   Example of disabling multi module plugin with dependencies on Jira Server, 1M issues, 400 concurrent users: New logs INFO INFO [plugin-transaction] numberStartEvents:111, numberEndEvents:111, numberSendEvents:111, numberEventsInTransactions:39942 This is logged every 5min and shows statistics related to the new plugin transaction event.  numberSendEvents - can be read as the number of triggered cache clears numberEventsInTransactions - can be read as the number of plugin events which were part of the transactions (i.e. before would trigger cache clears) In this example we can see that on this Jira instance, since Jira started cache clearing events were triggered 111 times instead of 39942. DEBUG com.atlassian.jira.plugin.PluginTransactionListener -  this will print statistic for every plugin event transaction DEBUG - [plugin-transaction] transaction starts with event:PluginTransactionStartEvent DEBUG [plugin-transaction] transaction ends with event:PluginTransactionEndEvent, numberEventsInTransaction:6007, firstEvent: PluginDisablingEvent for plugin-test, lastEvent: PluginDisabledEvent for plugin-test Versions This fix will be available in Jira 8.2.0 release. Soon after 8.2 release we are planning to back port it to 7.13.x.  

            Atlassian...please note that you don't even need a high load on Jira to make this happen. 

            It's disappointing that a fix for a bug classified as "highest priority, critical severity, 50% occurrence factor" didn't make it in to Jira 8.0. 

            Michael Evans added a comment - Atlassian...please note that you don't even need a high load on Jira to make this happen.  It's disappointing that a fix for a bug classified as "highest priority, critical severity, 50% occurrence factor" didn't make it in to Jira 8.0. 

              mswinarski Maciej Swinarski (Inactive)
              ayakovlev@atlassian.com Andriy Yakovlev [Atlassian]
              Affected customers:
              82 This affects my team
              Watchers:
              132 Start watching this issue

                Created:
                Updated:
                Resolved: