Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-64908

UPM actions may flush internal caches, leading to performance problems

    • 6.04
    • 142
    • Severity 1 - Critical
    • 2,510
    • Hide
      Atlassian Update – 26 April 2019

      Dear Jira users,

      we’re glad to announce that this issue will be addressed in our upcoming 8.2.0 release!

      We’ve fixed several issues that caused performance degradation in large Jira instances when installing or updating apps. If you’ve encountered this problem before, it’s solved, so feel free to update your apps anytime. For technical details please go this comment.

       

      Looking forward for your comments.

      Thank you,
      Grazyna Kaszkur

      Product Manager,
      Jira Server and Data Center

      Show
      Atlassian Update – 26 April 2019 Dear Jira users, we’re glad to announce that this issue will be addressed in our upcoming 8.2.0 release! We’ve fixed several issues that caused performance degradation in large Jira instances when installing or updating apps. If you’ve encountered this problem before, it’s solved, so feel free to update your apps anytime. For technical details please go this comment.   Looking forward for your comments. Thank you, Grazyna Kaszkur
 Product Manager, Jira Server and Data Center

      Summary

      During plugin (add-on) installs/updates (including license updates) /delete/disable action UPM will flush the large number of caches. Under high load that will cause significant JIRA performance degradation and might cause JIRA to freeze/stall for a long time.

      Steps to Reproduce

      1. Apply high load to JIRA
      2. Upgrade any plugin.

      Expected Results

      • JIRA performs fast, could be slight performance degradation.
      • JIRA admin is notified about the impact

      Actual Results

      • JIRA admin is not notified about the impact
      • Thread dump might show the following thread UpmAsynchronousTaskManager:thread
        "UpmAsynchronousTaskManager:thread-1" #38891 prio=5 os_prio=0 tid=0x00007fbda8047800 nid=0x6712 waiting on condition [0x00007fbcf9cd5000]
           java.lang.Thread.State: WAITING (parking)
        	at sun.misc.Unsafe.park(Native Method)
        	- parking to wait for  <0x000000056c734418> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        ...
        	at net.sf.ehcache.concurrent.ReadWriteLockSync.lock(ReadWriteLockSync.java:50)
        	at com.atlassian.cache.ehcache.LoadingCache.remove(LoadingCache.java:190)
        	at com.atlassian.cache.ehcache.DelegatingCachedReference.reset(DelegatingCachedReference.java:79)
        	at com.atlassian.jira.issue.search.managers.DefaultSearchHandlerManager.refresh(DefaultSearchHandlerManager.java:290)
        	at com.atlassian.jira.issue.search.managers.DefaultIssueSearcherManager.refresh(DefaultIssueSearcherManager.java:46)
        	at com.atlassian.jira.issue.fields.DefaultFieldManager.refreshSearchersAndIndexers(DefaultFieldManager.java:804)
        	at com.atlassian.jira.issue.fields.DefaultFieldManager.refresh(DefaultFieldManager.java:706)
        	at com.atlassian.jira.plugin.JiraCacheResetter$Delegate.resetCaches(JiraCacheResetter.java:83)
        	at com.atlassian.jira.plugin.JiraCacheResetter$Delegate.onPluginModuleDisabled(JiraCacheResetter.java:53)
        ...
        	at com.atlassian.plugin.event.impl.DefaultPluginEventManager.broadcast(DefaultPluginEventManager.java:73)
        	at com.atlassian.plugin.manager.DefaultPluginManager.broadcastIgnoreError(DefaultPluginManager.java:2081)
        	at com.atlassian.plugin.manager.DefaultPluginManager.publishModuleDisabledEvents(DefaultPluginManager.java:1955)
        	at com.atlassian.plugin.manager.DefaultPluginManager.disablePluginModuleNoPersist(DefaultPluginManager.java:1910)
        	at com.atlassian.plugin.manager.DefaultPluginManager.disablePluginModules(DefaultPluginManager.java:1904)
        	at com.atlassian.plugin.manager.DefaultPluginManager.disablePluginWithModuleEvents(DefaultPluginManager.java:1860)
        	at com.atlassian.plugin.manager.DefaultPluginManager.notifyPluginDisabled(DefaultPluginManager.java:1888)
        	at com.atlassian.plugin.manager.DefaultPluginManager.unloadPlugin(DefaultPluginManager.java:1022)
        	at com.atlassian.plugin.manager.DefaultPluginManager.uninstallNoEvent(DefaultPluginManager.java:981)
        	at com.atlassian.plugin.manager.DefaultPluginManager.updatePlugin(DefaultPluginManager.java:1343)
        	at com.atlassian.plugin.manager.DefaultPluginManager.addPlugins(DefaultPluginManager.java:1133)
        	at com.atlassian.jira.plugin.JiraPluginManager.addPlugins(JiraPluginManager.java:150)
        	at com.atlassian.plugin.manager.DefaultPluginManager.scanForNewPlugins(DefaultPluginManager.java:903)
        	at com.atlassian.plugin.manager.DefaultPluginManager.installPlugins(DefaultPluginManager.java:821)
        	at com.atlassian.jira.plugin.JiraPluginManager.installPlugins(JiraPluginManager.java:168)
        ...
        

      Notes

      Problem is amplified at JIRA Datacenter. Cache flush is triggered during node startup as UPM needs to register and load modules from plugins. See JRASERVER-66839

      Workaround

      Please plan any plugin actions updates/delete/disable (UPM actions) during low peak hours or maintenance windows. See Best Practices for Managing JIRA Application Add-ons for recommendations.
       

        1. plugin_disable_80.png
          plugin_disable_80.png
          341 kB
        2. plugin_disable_82.png
          plugin_disable_82.png
          133 kB

            [JRASERVER-64908] UPM actions may flush internal caches, leading to performance problems

            Hi,

            Thank you for the recent comments.

            We want to inform you that we have decided not to backport this fix to the 7.13 Enterprise Release version. While we’ve been working to backport this fix in recent months, after careful analysis, we ultimately feel that the change would bring too much risk to the stability of the release.

            We have made this tough call due to the fact that the changes introduced in 8.2 were complex, scattered over many libraries and codebases. Additionally, backporting this particular change over the platform release involved many moving parts. All this made us realise that the risk we might bring with this change is far too severe and we might cause even more problems if we decide to go forward with this backport.

            We are aware that you were waiting for this update and our decision may be disappointing to you. We would like to reassure you that we will keep backporting crucial bugs to our long term support versions as we have stated here: https://confluence.atlassian.com/enterprise/atlassian-enterprise-releases-948227420.html, simply in some cases this brings more risks than benefits that we hadn’t originally anticipated.

            If you are impacted by this bug, we recommend upgrading to our newest Enterprise release, version 8.5.x. Please follow our upgrade guide for Jira 8.5: https://confluence.atlassian.com/jiracore/jira-8-5-enterprise-release-upgrade-guide-976161392.html.

            Best,

            Jira Server and Data Center product management

            Grazyna Kaszkur added a comment - Hi, Thank you for the recent comments. We want to inform you that we have decided not to backport this fix to the 7.13 Enterprise Release version. While we’ve been working to backport this fix in recent months, after careful analysis, we ultimately feel that the change would bring too much risk to the stability of the release. We have made this tough call due to the fact that the changes introduced in 8.2 were complex, scattered over many libraries and codebases. Additionally, backporting this particular change over the platform release involved many moving parts. All this made us realise that the risk we might bring with this change is far too severe and we might cause even more problems if we decide to go forward with this backport. We are aware that you were waiting for this update and our decision may be disappointing to you. We would like to reassure you that we will keep backporting crucial bugs to our long term support versions as we have stated here: https://confluence.atlassian.com/enterprise/atlassian-enterprise-releases-948227420.html , simply in some cases this brings more risks than benefits that we hadn’t originally anticipated. If you are impacted by this bug, we recommend upgrading to our newest Enterprise release, version 8.5.x. Please follow our upgrade guide for Jira 8.5: https://confluence.atlassian.com/jiracore/jira-8-5-enterprise-release-upgrade-guide-976161392.html . Best, Jira Server and Data Center product management

            ivan.balandzin,
            thanks for asking. We're currently at the phase of testing this backport. It's not a trivial fix, so unfortunately it takes time.

            Kamil Cichy (Inactive) added a comment - ivan.balandzin , thanks for asking. We're currently at the phase of testing this backport. It's not a trivial fix, so unfortunately it takes time.

            Hello gkaszkur,
            Any news on backport of fix to 7.13.* since November? Month has passed...

            Ivan Balandzin added a comment - Hello  gkaszkur , Any news on backport of fix to 7.13.* since November? Month has passed...

            Hi richard.carini,

            thank you for your question. I can confirm that the team is working on backporting the fix to 7.13.

            Unfortunately I can not provide any exact date or fix version when it will be available, because it’s a considerable chunk of work which touches multiple projects. Moreover 7.13 line is an Enterprise Release, that is why, we will need to pay extraordinary care that the changes don’t cause any regressions, so testing will have to be more rigid.

            Please stay tuned and watch this ticket for further updates. 

            Thank you, 

            Grażyna

             

            Grazyna Kaszkur added a comment - Hi richard.carini , thank you for your question. I can confirm that the team is working on backporting the fix to 7.13. Unfortunately I can not provide any exact date or fix version when it will be available, because it’s a considerable chunk of work which touches multiple projects. Moreover 7.13 line is an Enterprise Release, that is why, we will need to pay extraordinary care that the changes don’t cause any regressions, so testing will have to be more rigid. Please stay tuned and watch this ticket for further updates.  Thank you,  Grażyna  

            Hi gkaszkur,

            Any news regarding which 7.13.x ER this is getting into? I am needing to schedule a 7.13 update and would like to wait for this fix.

            Thanks again!
            Rick

            Rick Carini added a comment - Hi gkaszkur , Any news regarding which 7.13.x ER this is getting into? I am needing to schedule a 7.13 update and would like to wait for this fix. Thanks again! Rick

            Hey baldin1697836153
            I wanted to confirm that we are aiming to backport this fix to 7.13 Enterprise Release, however we do not have a specific 7.13 ER version yet. This is due to the fact that we have decided to extend the soaking period for this fix as a stability precaution before the backport.
            Please expect another update from our side around October/ early November 2019.

            Thank you.

            Grazyna Kaszkur added a comment - Hey baldin1697836153 ,  I wanted to confirm that we are aiming to backport this fix to 7.13 Enterprise Release, however we do not have a specific 7.13 ER version yet. This is due to the fact that we have decided to extend the soaking period for this fix as a stability precaution before the backport. Please expect another update from our side around October/ early November 2019. Thank you.

            Has this been backported to a 7.13 ER?

            Dom Baldin [Adobe] added a comment - Has this been backported to a 7.13 ER?

            Will you also be back porting to the other enterprise version, 7.6.x?

            Is there a similar issue for Confluence, or given the difference in how Caching is done in each product mean that this is a non-issue there?

            Thanks!

             

             

            Rick Carini added a comment - Will you also be back porting to the other enterprise version, 7.6.x? Is there a similar issue for Confluence, or given the difference in how Caching is done in each product mean that this is a non-issue there? Thanks!    

            Fix description

            Problem

            There is a pattern across Jira where caches are cleared when there is a change in the plugin system. A single user action on the plugin system: installing, enabling, disabling, updating a plugin could trigger thousands of plugin events and each could result in clearing some caches. This, combined with incoming user traffic which was triggering loading caches, seriously affected the performance of both the plugin action and the user request time. There was also a problem with disabling plugins in an incorrect order. This, because of plugin dependencies and plugins providing content to extension points, could seriously affect page load time. In extreme situations, a plugin action affecting many modules could affect the stability of Jira for a long time. 

            Jira 8.0: Example of disabling multi module plugin with dependencies on Jira Server, 1M issues, 400 concurrent users:

            Solution

            1. We have limited the number of plugin events which are used for clearing caches. A user action on the plugin system, potentially  triggering thousands of plugin events, is wrapped into a "transaction", and now clearing caches is done based on just a few transactions triggered by this action.
            2. In Jira DC, this new transaction event that triggers the clearing of caches is always run with cache replication turned off. Therefore, caches are not unnecessarily cleared multiple times on each node.
            3. We have fixed the order of shutting down plugins to respect dependencies
            4. There is now a plugin shutdown timeout (500ms) - required when shutting down plugins with cyclic dependencies 

            With these fixes, the same long-running operations from before now have very little effect on response times, and the time to perform plugin operations is much shorter.

            Jira 8.2: Example of disabling multi module plugin with dependencies on Jira Server, 1M issues, 400 concurrent users:

            New logs

            INFO

            INFO [plugin-transaction] numberStartEvents:111, numberEndEvents:111, numberSendEvents:111, numberEventsInTransactions:39942 

            This is logged every 5min and shows statistics related to the new plugin transaction event. 

            • numberSendEvents - can be read as the number of triggered cache clears
            • numberEventsInTransactions - can be read as the number of plugin events which were part of the transactions (i.e. before would trigger cache clears)

            In this example we can see that on this Jira instance, since Jira started cache clearing events were triggered 111 times instead of 39942.

            DEBUG

            com.atlassian.jira.plugin.PluginTransactionListener -  this will print statistic for every plugin event transaction

            DEBUG - [plugin-transaction] transaction starts with event:PluginTransactionStartEvent
            DEBUG [plugin-transaction] transaction ends with event:PluginTransactionEndEvent, numberEventsInTransaction:6007, firstEvent: PluginDisablingEvent for plugin-test, lastEvent: PluginDisabledEvent for plugin-test

            Versions

            This fix will be available in Jira 8.2.0 release. Soon after 8.2 release we are planning to back port it to 7.13.x.

             

            Maciej Swinarski (Inactive) added a comment - - edited Fix description Problem There is a pattern across Jira where caches are cleared when there is a change in the plugin system. A single user action on the plugin system: installing, enabling, disabling, updating a plugin could trigger thousands of plugin events and each could result in clearing some caches. This, combined with incoming user traffic which was triggering loading caches, seriously affected the performance of both the plugin action and the user request time. There was also a problem with disabling plugins in an incorrect order. This, because of plugin dependencies and plugins providing content to extension points, could seriously affect page load time. In extreme situations, a plugin action affecting many modules could affect the stability of Jira for a long time.  Jira 8.0:   Example of disabling multi module plugin with dependencies on Jira Server, 1M issues, 400 concurrent users: Solution We have limited the number of plugin events which are used for clearing caches. A user action on the plugin system, potentially  triggering thousands of plugin events, is wrapped into a "transaction", and now clearing caches is done based on just a few transactions triggered by this action. In Jira DC, this new transaction event that triggers the clearing of caches is always run with cache replication turned off. Therefore, caches are not unnecessarily cleared multiple times on each node. We have fixed the order of shutting down plugins to respect dependencies There is now a plugin shutdown timeout (500ms) - required when shutting down plugins with cyclic dependencies  With these fixes, the same long-running operations from before now have very little effect on response times, and the time to perform plugin operations is much shorter. Jira 8.2:   Example of disabling multi module plugin with dependencies on Jira Server, 1M issues, 400 concurrent users: New logs INFO INFO [plugin-transaction] numberStartEvents:111, numberEndEvents:111, numberSendEvents:111, numberEventsInTransactions:39942 This is logged every 5min and shows statistics related to the new plugin transaction event.  numberSendEvents - can be read as the number of triggered cache clears numberEventsInTransactions - can be read as the number of plugin events which were part of the transactions (i.e. before would trigger cache clears) In this example we can see that on this Jira instance, since Jira started cache clearing events were triggered 111 times instead of 39942. DEBUG com.atlassian.jira.plugin.PluginTransactionListener -  this will print statistic for every plugin event transaction DEBUG - [plugin-transaction] transaction starts with event:PluginTransactionStartEvent DEBUG [plugin-transaction] transaction ends with event:PluginTransactionEndEvent, numberEventsInTransaction:6007, firstEvent: PluginDisablingEvent for plugin-test, lastEvent: PluginDisabledEvent for plugin-test Versions This fix will be available in Jira 8.2.0 release. Soon after 8.2 release we are planning to back port it to 7.13.x.  

            Atlassian...please note that you don't even need a high load on Jira to make this happen. 

            It's disappointing that a fix for a bug classified as "highest priority, critical severity, 50% occurrence factor" didn't make it in to Jira 8.0. 

            Michael Evans added a comment - Atlassian...please note that you don't even need a high load on Jira to make this happen.  It's disappointing that a fix for a bug classified as "highest priority, critical severity, 50% occurrence factor" didn't make it in to Jira 8.0. 

            Becasue of this bug we have outages each time we update plugins. Datacenter version should be more stable I think.Becasue of this bug it is not . Sounds like a serious bug.

            It is very important. Please fix it.

            Michal Zwierzchowski added a comment - Becasue of this bug we have outages each time we update plugins. Datacenter version should be more stable I think.Becasue of this bug it is not . Sounds like a serious bug. It is very important. Please fix it.

            Jira 7.7.2 Server is concerned as well, would be glad to have it fixed, 

            Kind Regards, 

            Oksana Andreis [Valiantys]

             

            Valiantys Support added a comment - Jira 7.7.2 Server is concerned as well, would be glad to have it fixed,  Kind Regards,  Oksana Andreis [Valiantys]  

            Jira 7.2.9 gets completely unresponsive and sometimes need to get even restarted during add on updates. The updates for even simple add-ons which are not really in use take a long time and make the system unresponsive.

            There is not even the ability to see at least the standard dashboard. It is completely off.

            freitel_mobile added a comment - Jira 7.2.9 gets completely unresponsive and sometimes need to get even restarted during add on updates. The updates for even simple add-ons which are not really in use take a long time and make the system unresponsive. There is not even the ability to see at least the standard dashboard. It is completely off.

            Jira 7.5.0 server ...

            Soporte Ferrovial added a comment - Jira 7.5.0 server ...

            On the fieldlayoutitem table concern, you may be interested in https://jira.atlassian.com/browse/JRASERVER-29310

            Zul NS [Atlassian] added a comment - On the fieldlayoutitem table concern, you may be interested in https://jira.atlassian.com/browse/JRASERVER-29310

            S Stack added a comment -

            JIRA 7.2.8 Data Center

            Impact on (Postgres) DB

            During a single plugin update (JIRA Portfolio), Postgres spent 760 seconds processing the fieldlayoutitem table. For about 5-7 minutes, JIRA was almost completely unresponsive.

            S Stack added a comment - JIRA 7.2.8 Data Center Impact on (Postgres) DB During a single plugin update (JIRA Portfolio), Postgres spent 760 seconds processing the fieldlayoutitem table. For about 5-7 minutes, JIRA was almost completely unresponsive.

            To be honest: The work arounds listed here do not apply. It does not matter what time a day or load we make the changes. When we upgrade UPM or almost any plugin through the UI our system is completely unavailable while the plugin system upgrades the plugin or sets of plugins. This usually lasts for about 20-30 minutes.'

            Our only option to not "disrupt" users is to take the application offline, put in the upgraded plugins into the filesystem, and then start the application back up. But this is also a 20-30 minute outage due to PSMQ delaying startup by over 15 minutes in datacenter. JSDSERVER-5219

            Micah Figone added a comment - To be honest: The work arounds listed here do not apply. It does not matter what time a day or load we make the changes. When we upgrade UPM or almost any plugin through the UI our system is completely unavailable while the plugin system upgrades the plugin or sets of plugins. This usually lasts for about 20-30 minutes.' Our only option to not "disrupt" users is to take the application offline, put in the upgraded plugins into the filesystem, and then start the application back up. But this is also a 20-30 minute outage due to PSMQ delaying startup by over 15 minutes in datacenter.  JSDSERVER-5219

            we are on JIRA Data Center 7.3.4 and everytime we upgrade the UPM plugin (which should be as simple as clicking on the Update button as Atlassian itself suggests) our whole JIRA instance stalls, performance is degraded big time (we observe high CPU usage), and the system becomes practically unusable. 

            Categorizing this issue as "low priority" by Atlassian is kind of insulting to be honest.

            Tomas Arguinzones Yahoo added a comment - we are on JIRA Data Center 7.3.4 and everytime we upgrade the UPM plugin (which should be as simple as clicking on the Update button as Atlassian itself suggests) our whole JIRA instance stalls, performance is degraded big time (we observe high CPU usage), and the system becomes practically unusable.  Categorizing this issue as "low priority" by Atlassian is kind of insulting to be honest.

            S Stack added a comment -

            JIRA 7.2.8 Data Center

            We find we need to monitor our DB host very carefully after each plugin update (especially the first update which is the most resource-intensive). After about 10 minutes the DB host quiesces and then we can update the next plugin.

            Note that we choose off-hours to update plugins and still see very high resource consumption.

            This is a poor experience for administrators.
             

            S Stack added a comment - JIRA 7.2.8 Data Center We find we need to monitor our DB host very carefully after each plugin update (especially the first update which is the most resource-intensive). After about 10 minutes the DB host quiesces and then we can update the next plugin. Note that we choose off-hours to update plugins and still see very high resource consumption. This is a poor experience for administrators.  

            Micah Figone added a comment - - edited

            @matt Yeah.... Well.... Sometimes you have to go backwards to go forwards....

            Micah Figone added a comment - - edited @matt Yeah.... Well.... Sometimes you have to go backwards to go forwards....

            MattS added a comment -

            @micah your workaround is the same as using JIRA 3.x

            MattS added a comment - @micah your workaround is the same as using JIRA 3.x

            MattS added a comment - - edited

            One of my large enterprise customers had a 30 minute period of JIRA being unavailable due to this. Not a Low priority, surely

            MattS added a comment - - edited One of my large enterprise customers had a 30 minute period of JIRA being unavailable due to this. Not a Low priority, surely

            Agree with Stanton. As an example, we upgraded the service desk plugin on our dev instance and it took over 20 minutes for the upgrade to finish. And that is with a system that has no load. Imagine if we did this on prod.

             

            Workaround: Dont upgrade through upm. Download the update, install in plugins directory, and restart jira.

            Micah Figone added a comment - Agree with Stanton. As an example, we upgraded the service desk plugin on our dev instance and it took over 20 minutes for the upgrade to finish. And that is with a system that has no load. Imagine if we did this on prod.   Workaround: Dont upgrade through upm. Download the update, install in plugins directory, and restart jira.

            This should be much higher priority than "low". There is no such thing as an idle time for a JIRA instance used around the world, so there is no time that we can guess is "safe" for upgrading a plugin. Anything like this that can bring down JIRA from a simple maintenance action is a very serious bug.

            Stanton Stevens added a comment - This should be much higher priority than "low". There is no such thing as an idle time for a JIRA instance used around the world, so there is no time that we can guess is "safe" for upgrading a plugin. Anything like this that can bring down JIRA from a simple maintenance action is a very serious bug.

              mswinarski Maciej Swinarski (Inactive)
              ayakovlev@atlassian.com Andriy Yakovlev [Atlassian]
              Affected customers:
              82 This affects my team
              Watchers:
              132 Start watching this issue

                Created:
                Updated:
                Resolved: