Uploaded image for project: 'Jira Service Management Data Center'
  1. Jira Service Management Data Center
  2. JSDSERVER-5736

Poor performance with high CPU and a high number of SdOffThreadEventJobRunner threads

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Highest Highest
    • 4.4.0
    • 3.9.0, (20)
      3.9.1, 3.9.2, 3.9.3, 3.9.4, 3.9.6, 3.9.7, 3.9.8, 3.9.9, 3.9.10, 3.9.11, 3.10.0, 3.10.1, 3.10.2, 3.11.0, 3.11.1, 3.16.0, 3.16.1, 4.0.0, 4.1.0, 4.2.0
    • SLA

       

      Atlassian Update – 10 September 2021

      Hi everyone,

      Thank you for your feedback on the ticket and supporting our team in our investigation!

      After analysing the problem, we have identified the issue of "Poor performance with high CPU and a high number of SdOffThreadEventJobRunner threads" has been fixed in JSD 4.4.0. This problem occurred due to the unbounded nature of threads before 4.4.0, which resulted in a high DB load on the instance.

      However - we have identified two other issues with the JSM async processing logic which we need to resolve, related to the problems reported in this bug ticket. These issues are tracked in their respective tickets -

      1. JSDSERVER-5730

      2. JSDSERVER-8635

      • This is related to threads deadlocking when processing for one issue takes over 5 minutes. This ticket is currently gathering impact.

      Please let us know if you have any further concerns with the above, please open a support ticket via https://support.atlassian.com

      Thank you,

      Alex

      Description 

      JSD 3.9.0 attempts to address some of the friction between the SLA system and automation (JSDSERVER-4743) and poor issue creation performance by introducing a wrapper event type (inspired by OnCommitEvent) and an “expectation” system.

      The expectation system gives features that are interested in one or more eligible event types a way to explicitly define the work that should be done before a wrapped event is dispatched, by submitting “jobs” that are executed in the strict cluster-wide order of their submission (no more than one job at a time for each issue) using a thread pool to avoid blocking any request threads (though we just use the submitting thread if it’s not a request thread).

      The wrapper event type does the same for the work that should be done after what we refer to as “completion”, by defining @EventListener methods of type public void(ServiceDeskWrappedOnCompletionEvent).

      At least two recent support cases have involved severe performance degradation of a node in and/or the database for an instance that seems to have been caused or exacerbated by the expectation system, so we’ll link potential causes to this issue as we find them.

      Diagnosis

      • High CPU usage on DB server
      • Increased number of threads used by the Jira process
      • High number of SdOffThreadEventJobRunner threads on thread dumps connecting to the database

      Possible workaround (JSD 3.9+)

      These steps affect the expectation system such that jobs are always executed immediately on the submitting thread, without touching any OffThreadEventJobRunner or PSMQ code paths, as if the submitting threads are never request threads (JSDSERVER-5730).

      1. Go to the dark feature settings page (<baseURL>/secure/SiteDarkFeatures!default.jspa)
      2. Remove the feature flag sd.internal.base.off.thread.on.completion.events.enabled, if it exists
      3. Add the following feature flag: sd.internal.base.off.thread.on.completion.events.disabled
      4. Restart JIRA

      SLA accuracy shouldn’t be negatively affected, but issue creation might take longer as a result. WHEN issue created automation rules with SLA-related JQL should still work (JSDSERVER-4743).

            [JSDSERVER-5736] Poor performance with high CPU and a high number of SdOffThreadEventJobRunner threads

            Atlassian Update – 10 September 2021

            Hi everyone,

            Thank you for your feedback on the ticket and supporting our team in our investigation!

            After analysing the problem, we have identified the issue of "Poor performance with high CPU and a high number of SdOffThreadEventJobRunner threads" has been fixed in JSD 4.4.0. This problem occurred due to the unbounded nature of threads before 4.4.0, which resulted in a high DB load on the instance.

            However - we have identified two other issues with the JSM async processing logic which we need to resolve, related to the problems reported in this bug ticket. These issues are tracked in their respective tickets -

            1. JSDSERVER-5730

            2. JSDSERVER-8635

            • This is related to threads deadlocking when processing for one issue takes over 5 minutes. This ticket is currently gathering impact.

            Please let us know if you have any further concerns with the above, please open a support ticket via https://support.atlassian.com

            Thank you,

            Alex

            Alex Cooksey added a comment - Atlassian Update – 10 September 2021 Hi everyone, Thank you for your feedback on the ticket and supporting our team in our investigation! After analysing the problem, we have identified the issue of "Poor performance with high CPU and a high number of SdOffThreadEventJobRunner threads" has been fixed in JSD 4.4.0 . This problem occurred due to the unbounded nature of threads before 4.4.0, which resulted in a high DB load on the instance. However - we have identified two other issues with the JSM async processing logic which we need to resolve, related to the problems reported in this bug ticket. These issues are tracked in their respective tickets - 1. JSDSERVER-5730 This is related to deadlocking of threads when there are frequent actions on one request. A fix for this is released behind a dark feature in 4.9.0 The development team will be working on enabling this by default in future. More details on the fix - https://confluence.atlassian.com/jirakb/deadlocking-in-jira-service-desk-when-frequently-updating-the-same-issue-979428323.html 2. JSDSERVER-8635 This is related to threads deadlocking when processing for one issue takes over 5 minutes. This ticket is currently gathering impact. Please let us know if you have any further concerns with the above, please open a support ticket via https://support.atlassian.com Thank you, Alex

            We experiencing a crash that seems to be related to this. (Service Management Server 4.15)

            There is contention on the Queue and Message tables which seems to cause either regular Deadlock or sometimes results in a MySQL crash.

            Aug 19 07:05:10 localhost mysqld[11332]: Some pointers may be invalid and cause the dump to abort.
            Aug 19 07:05:10 localhost mysqld[11332]: Query (0x7f34040107c0): update `AO_319474_MESSAGE` set `CLAIMANT` = null, `CLAIMANT_TIME` = null where `AO_319474_MESSAGE`.`QUEUE_ID` = 1214021 and `AO_319474_MESSAGE`.`CLAIMANT` is not null
            Aug 19 07:05:10 localhost mysqld[11332]: Connection ID (thread ID): 12046760

            Stephan Vos added a comment - We experiencing a crash that seems to be related to this. (Service Management Server 4.15) There is contention on the Queue and Message tables which seems to cause either regular Deadlock or sometimes results in a MySQL crash. Aug 19 07:05:10 localhost mysqld [11332] : Some pointers may be invalid and cause the dump to abort. Aug 19 07:05:10 localhost mysqld [11332] : Query (0x7f34040107c0): update `AO_319474_MESSAGE` set `CLAIMANT` = null, `CLAIMANT_TIME` = null where `AO_319474_MESSAGE`.`QUEUE_ID` = 1214021 and `AO_319474_MESSAGE`.`CLAIMANT` is not null Aug 19 07:05:10 localhost mysqld [11332] : Connection ID (thread ID): 12046760

            Alex Cooksey added a comment - - edited
            Atlassian Update – 30 June 2021

            Hi everyone,

            Thank you for your feedback regarding this bug and bringing it to our attention. Since this bug has been re-opened in May we have resumed our investigation into this issue.

            Currently we're unable to reproduce this issue and are looking to work with customers directly affected who are able to reproduce the problem.

            If you're able to reproduce the issue, can you please open a support ticket via https://support.atlassian.com, and let the support engineer know about this ticket and this message. We'll work with you to make sure the development team has the information they need to begin working on resolving it.

            Please let us know if you have any concerns with the above.

            Thank you,

            Alex

            Alex Cooksey added a comment - - edited Atlassian Update – 30 June 2021 Hi everyone, Thank you for your feedback regarding this bug and bringing it to our attention. Since this bug has been re-opened in May we have resumed our investigation into this issue. Currently we're unable to reproduce this issue and are looking to work with customers directly affected who are able to reproduce the problem. If you're able to reproduce the issue, can you please open a support ticket via https://support.atlassian.com , and let the support engineer know about this ticket and this message. We'll work with you to make sure the development team has the information they need to begin working on resolving it. Please let us know if you have any concerns with the above. Thank you, Alex

            Hi Denise,
            Thank you!
            Sorry for bothering you.

            Gonchik Tsymzhitov added a comment - Hi Denise, Thank you! Sorry for bothering you.

            Issue occurring in JSD 4.15

            Andrea Hakim added a comment - Issue occurring in JSD 4.15

            To be clear, I am not on the Jira Development team, so I can't update the fix version, and I can't fix this problem myself. We are experiencing this on an internal Jira that my team owns, so I have reopened it for the Jira team to re-triage.

            Denise Unterwurzacher [Atlassian] (Inactive) added a comment - To be clear, I am not on the Jira Development team, so I can't update the fix version, and I can't fix this problem myself. We are experiencing this on an internal Jira that my team owns, so I have reopened it for the Jira team to re-triage.

            dunterwurzacher Could you clean a fix version please ?

            Gonchik Tsymzhitov added a comment - dunterwurzacher Could you clean a fix version please ?

            dunterwurzacher Thank you!

            Gonchik Tsymzhitov added a comment - dunterwurzacher Thank you!

            Reopening this for the Jira team to triage again, as it seems to still be occurring for a lot of folks in later versions.

            Denise Unterwurzacher [Atlassian] (Inactive) added a comment - Reopening this for the Jira team to triage again, as it seems to still be occurring for a lot of folks in later versions.

            Curious what the workaround would do for those not running Service Desk.  Is there still a benefit of adding sd.internal.base.off.thread.on.completion.events.disabled?

            Kevin Dalton added a comment - Curious what the workaround would do for those not running Service Desk.  Is there still a benefit of adding sd.internal.base.off.thread.on.completion.events.disabled?

            Magic to see in the logs of JSD DC on 4.13.3.
            And flag has been removed, maybe other cause ?

            Gonchik Tsymzhitov added a comment - Magic to see in the logs of JSD DC on 4.13.3. And flag has been removed, maybe other cause ?

            Is there any fix for 3.16.1 ?  or is it only to upgrade past 4.3.0 version of JSD?  WHat is the exact action causing this? SLA processing? Reporting.  Im trying to understand what the trigger is as it comes out of nowhere.

            kevin.lacinski added a comment - Is there any fix for 3.16.1 ?  or is it only to upgrade past 4.3.0 version of JSD?  WHat is the exact action causing this? SLA processing? Reporting.  Im trying to understand what the trigger is as it comes out of nowhere.

            JSD Enterprise release (3.16.x) don't have this fix?? My customer need this fix for his versione (3.16.x)

            Miguel Ángel Pérez Montero added a comment - JSD Enterprise release (3.16.x) don't have this fix?? My customer need this fix for his versione (3.16.x)

            Does anyone who has upgraded to the fix version have feedback on performance post-upgrade? I'm waiting for the enterprise release, but am curious about others' experience.

            Matthew Dell added a comment - Does anyone who has upgraded to the fix version have feedback on performance post-upgrade? I'm waiting for the enterprise release, but am curious about others' experience.

            Released on 22nd July with JSD 4.3.0.

            Mohil Chandra added a comment - Released on 22nd July with JSD 4.3.0.

            The Version 4.3.0 was released in July 21, bug fixed?

            Sergio C Silva added a comment - The Version 4.3.0 was released in July 21, bug fixed?

            David Yu added a comment -

            Is there any particular kind of activity that would trigger this bug? And is it only constrained to activity within a Service Desk project?

            David Yu added a comment - Is there any particular kind of activity that would trigger this bug? And is it only constrained to activity within a Service Desk project?

            Unfortunately im quite sure they will not fix it anytime soon, if they havent done it when they changed major version and did an overhaul of lots of the inner workings they wont do it now. Most likely the problem is much too complicated to solve and while the systems still work, why change it?

            I hope im wrong, as we have the same issue in our logs everyday and we have a very very small instance of 6 agents and at most 30 customers at any one time on the platform.

            Marius Dinca added a comment - Unfortunately im quite sure they will not fix it anytime soon, if they havent done it when they changed major version and did an overhaul of lots of the inner workings they wont do it now. Most likely the problem is much too complicated to solve and while the systems still work, why change it? I hope im wrong, as we have the same issue in our logs everyday and we have a very very small instance of 6 agents and at most 30 customers at any one time on the platform.

            Need a fix too

            Harold Wong added a comment - Need a fix too

            JWCho added a comment - - edited

            Still occurring on JSD 4.2.1, need to get this bug fixed soon!

            JWCho added a comment - - edited Still occurring on JSD 4.2.1, need to get this bug fixed soon!

            Hello All,

             

            We need a solution for this bug with a certain urgency, plz look at us atlassian!

            Marco Augusto Santinho Gonçalves added a comment - Hello All,   We need a solution for this bug with a certain urgency, plz look at us atlassian!

            Luca, this issue you describe looks like JRASERVER-63002 which should be solved in 8.0.0, but still seeing it in 8.0.2 as well ...

            Juergen Lanner added a comment - Luca, this issue you describe looks like  JRASERVER-63002 which should be solved in 8.0.0, but still seeing it in 8.0.2 as well ...

            Luca Tanieli added a comment - - edited

            Sill occurring in JSD 3.16.1, also with workarounds suggested, slowing a lot Jira response time. Users and DBA are not happy

            Following query is using all CPU in DB server:

            SELECT CG.ID, CG.issueid, CG.AUTHOR, CG.CREATED, CI.ID, CI.groupid, CI.FIELDTYPE, CI.FIELD, CI.OLDVALUE, CI.OLDSTRING, CI.NEWVALUE, CI.NEWSTRING FROM jiraschema.changegroup CG INNER JOIN jiraschema.changeitem CI ON CG.ID = CI.groupid WHERE CG.issueid=@P0 AND CI.FIELD=@P1 ORDER BY CG.CREATED ASC, CI.ID ASC
            

            Luca Tanieli added a comment - - edited Sill occurring in JSD 3.16.1, also with workarounds suggested, slowing a lot Jira response time. Users and DBA are not happy Following query is using all CPU in DB server: SELECT CG.ID, CG.issueid, CG.AUTHOR, CG.CREATED, CI.ID, CI.groupid, CI.FIELDTYPE, CI.FIELD, CI.OLDVALUE, CI.OLDSTRING, CI.NEWVALUE, CI.NEWSTRING FROM jiraschema.changegroup CG INNER JOIN jiraschema.changeitem CI ON CG.ID = CI.groupid WHERE CG.issueid=@P0 AND CI.FIELD=@P1 ORDER BY CG.CREATED ASC, CI.ID ASC

            We are an Atlassian partner in Brazil and we have a big customer that is being terribly impacted by this bug.
            After this error the customer is raising questions about the viability of continuing to use the JIRA Service Desk. We need a solution urgently.

            Sergio C Silva added a comment - We are an Atlassian partner in Brazil and we have a big customer that is being terribly impacted by this bug. After this error the customer is raising questions about the viability of continuing to use the JIRA Service Desk. We need a solution urgently.

            Still a thing in Jira SD 4.1.0

            Justin Nielsen added a comment - Still a thing in Jira SD 4.1.0

            huw added a comment -

            Seeing this in 3.15.3. 

            huw added a comment - Seeing this in 3.15.3. 

            Still occurring in Jira SD 4.1.0 and Software 8.1.0

            Ifteqar Ahmed added a comment - Still occurring in Jira SD 4.1.0 and Software 8.1.0

            Still occurring in JSD 3.14.2

             

            Michael Lasonde added a comment - Still occurring in JSD 3.14.2  

            Still occuring in 4.1

            Rich Wilkins added a comment - Still occuring in 4.1

            Why not in In progress? WTF Atlassian? 

            Tony Montana added a comment - Why not in In progress? WTF Atlassian? 

            Still Jira 8.0.2/JSD4.02

            Tony Montana added a comment - Still Jira 8.0.2/JSD4.02

            NIT added a comment -

            Still present in Jira 7.13.1/JSD 3.16.1

            NIT added a comment - Still present in Jira 7.13.1/JSD 3.16.1

            This is a critical error!!!!

            Sergio C Silva added a comment - This is a critical error!!!!

            Still present  in Jira 7.13.1/JSD 3.16.1

            Ala Ghoreishi added a comment - Still present  in Jira 7.13.1/JSD 3.16.1

            llagos added a comment -

            Present still in Jira 7.12.3 / JSD 3.15.3

            llagos added a comment - Present still in Jira 7.12.3 / JSD 3.15.3

            proea added a comment -

            Getting this error on JIRA Service Desk 4.0.0 / JIRA Software 8.0.0

            proea added a comment - Getting this error on JIRA Service Desk 4.0.0 / JIRA Software 8.0.0

            Getting this error on JSD 3.9.7 / JS 7.6.7

            Greg Kauffman added a comment - Getting this error on JSD 3.9.7 / JS 7.6.7

            Getting this error on JIRA Service Desk 3.12.0 / JS 7.9.0

            Remedy Partners added a comment - Getting this error on JIRA Service Desk 3.12.0 / JS 7.9.0

            We are experiencing this issue as well.  

            Jira Software: 7.12.0/JSD 3.15.0

            Deleted Account (Inactive) added a comment - We are experiencing this issue as well.   Jira Software: 7.12.0/JSD 3.15.0

            IT-BIB added a comment -

            Log scan results report this issue every day

            Jira Core 7.12.3 / JIRA Service Desk 3.15.3

            IT-BIB added a comment - Log scan results report this issue every day Jira Core 7.12.3 / JIRA Service Desk 3.15.3

            maccamlc added a comment -

            Have you considered using Kotlin and couroutines?

            maccamlc added a comment - Have you considered using Kotlin and couroutines?

            We are also experiencing this issue - daily. Is there an actual fix?

            Matt orchard added a comment - We are also experiencing this issue - daily. Is there an actual fix?

            Has there been any movement on this? My customer continues to experience this issue

            Bryan Robison added a comment - Has there been any movement on this? My customer continues to experience this issue

            We had to reenable ServiceDeskWrappedOnCompletionEvents this afternoon. Our users were experiencing extreme slowness when creating issues in the portal. Reenabling the events has caused the db connections to spike again so we've had to resort to toggling the events on and off when we've noticed high db connection usage

            Bryan Robison added a comment - We had to reenable ServiceDeskWrappedOnCompletionEvents this afternoon. Our users were experiencing extreme slowness when creating issues in the portal. Reenabling the events has caused the db connections to spike again so we've had to resort to toggling the events on and off when we've noticed high db connection usage

            Bryan Robison added a comment - - edited

            I've been experiencing this issue with a customer since we upgraded to JSD 3.10.2 at the end of February. I discovered this bug yesterday and we definitely meet all of the criteria in the diagnosis section. We  implemented the workaround above and the Connection Pool graphs show the difference we've seen in connection usage after disabling the queue. 

            Bryan Robison added a comment - - edited I've been experiencing this issue with a customer since we upgraded to JSD 3.10.2 at the end of February. I discovered this bug yesterday and we definitely meet all of the criteria in the diagnosis section. We  implemented the workaround above and the Connection Pool graphs show the difference we've seen in connection usage after disabling the queue. 

              mchandra@atlassian.com Mohil Chandra
              dazabani Delan Azabani (Inactive)
              Affected customers:
              195 This affects my team
              Watchers:
              188 Start watching this issue

                Created:
                Updated:
                Resolved: