Uploaded image for project: 'Jira Service Management Data Center'
  1. Jira Service Management Data Center
  2. JSDSERVER-5736

Poor performance with high CPU and a high number of SdOffThreadEventJobRunner threads

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Highest Highest
    • 4.4.0
    • 3.9.0, (20)
      3.9.1, 3.9.2, 3.9.3, 3.9.4, 3.9.6, 3.9.7, 3.9.8, 3.9.9, 3.9.10, 3.9.11, 3.10.0, 3.10.1, 3.10.2, 3.11.0, 3.11.1, 3.16.0, 3.16.1, 4.0.0, 4.1.0, 4.2.0
    • SLA

       

      Atlassian Update – 10 September 2021

      Hi everyone,

      Thank you for your feedback on the ticket and supporting our team in our investigation!

      After analysing the problem, we have identified the issue of "Poor performance with high CPU and a high number of SdOffThreadEventJobRunner threads" has been fixed in JSD 4.4.0. This problem occurred due to the unbounded nature of threads before 4.4.0, which resulted in a high DB load on the instance.

      However - we have identified two other issues with the JSM async processing logic which we need to resolve, related to the problems reported in this bug ticket. These issues are tracked in their respective tickets -

      1. JSDSERVER-5730

      2. JSDSERVER-8635

      • This is related to threads deadlocking when processing for one issue takes over 5 minutes. This ticket is currently gathering impact.

      Please let us know if you have any further concerns with the above, please open a support ticket via https://support.atlassian.com

      Thank you,

      Alex

      Description 

      JSD 3.9.0 attempts to address some of the friction between the SLA system and automation (JSDSERVER-4743) and poor issue creation performance by introducing a wrapper event type (inspired by OnCommitEvent) and an “expectation” system.

      The expectation system gives features that are interested in one or more eligible event types a way to explicitly define the work that should be done before a wrapped event is dispatched, by submitting “jobs” that are executed in the strict cluster-wide order of their submission (no more than one job at a time for each issue) using a thread pool to avoid blocking any request threads (though we just use the submitting thread if it’s not a request thread).

      The wrapper event type does the same for the work that should be done after what we refer to as “completion”, by defining @EventListener methods of type public void(ServiceDeskWrappedOnCompletionEvent).

      At least two recent support cases have involved severe performance degradation of a node in and/or the database for an instance that seems to have been caused or exacerbated by the expectation system, so we’ll link potential causes to this issue as we find them.

      Diagnosis

      • High CPU usage on DB server
      • Increased number of threads used by the Jira process
      • High number of SdOffThreadEventJobRunner threads on thread dumps connecting to the database

      Possible workaround (JSD 3.9+)

      These steps affect the expectation system such that jobs are always executed immediately on the submitting thread, without touching any OffThreadEventJobRunner or PSMQ code paths, as if the submitting threads are never request threads (JSDSERVER-5730).

      1. Go to the dark feature settings page (<baseURL>/secure/SiteDarkFeatures!default.jspa)
      2. Remove the feature flag sd.internal.base.off.thread.on.completion.events.enabled, if it exists
      3. Add the following feature flag: sd.internal.base.off.thread.on.completion.events.disabled
      4. Restart JIRA

      SLA accuracy shouldn’t be negatively affected, but issue creation might take longer as a result. WHEN issue created automation rules with SLA-related JQL should still work (JSDSERVER-4743).

          Form Name

            [JSDSERVER-5736] Poor performance with high CPU and a high number of SdOffThreadEventJobRunner threads

            Atlassian Update – 10 September 2021

            Hi everyone,

            Thank you for your feedback on the ticket and supporting our team in our investigation!

            After analysing the problem, we have identified the issue of "Poor performance with high CPU and a high number of SdOffThreadEventJobRunner threads" has been fixed in JSD 4.4.0. This problem occurred due to the unbounded nature of threads before 4.4.0, which resulted in a high DB load on the instance.

            However - we have identified two other issues with the JSM async processing logic which we need to resolve, related to the problems reported in this bug ticket. These issues are tracked in their respective tickets -

            1. JSDSERVER-5730

            2. JSDSERVER-8635

            • This is related to threads deadlocking when processing for one issue takes over 5 minutes. This ticket is currently gathering impact.

            Please let us know if you have any further concerns with the above, please open a support ticket via https://support.atlassian.com

            Thank you,

            Alex

            Alex Cooksey added a comment - Atlassian Update – 10 September 2021 Hi everyone, Thank you for your feedback on the ticket and supporting our team in our investigation! After analysing the problem, we have identified the issue of "Poor performance with high CPU and a high number of SdOffThreadEventJobRunner threads" has been fixed in JSD 4.4.0 . This problem occurred due to the unbounded nature of threads before 4.4.0, which resulted in a high DB load on the instance. However - we have identified two other issues with the JSM async processing logic which we need to resolve, related to the problems reported in this bug ticket. These issues are tracked in their respective tickets - 1. JSDSERVER-5730 This is related to deadlocking of threads when there are frequent actions on one request. A fix for this is released behind a dark feature in 4.9.0 The development team will be working on enabling this by default in future. More details on the fix - https://confluence.atlassian.com/jirakb/deadlocking-in-jira-service-desk-when-frequently-updating-the-same-issue-979428323.html 2. JSDSERVER-8635 This is related to threads deadlocking when processing for one issue takes over 5 minutes. This ticket is currently gathering impact. Please let us know if you have any further concerns with the above, please open a support ticket via https://support.atlassian.com Thank you, Alex

            We experiencing a crash that seems to be related to this. (Service Management Server 4.15)

            There is contention on the Queue and Message tables which seems to cause either regular Deadlock or sometimes results in a MySQL crash.

            Aug 19 07:05:10 localhost mysqld[11332]: Some pointers may be invalid and cause the dump to abort.
            Aug 19 07:05:10 localhost mysqld[11332]: Query (0x7f34040107c0): update `AO_319474_MESSAGE` set `CLAIMANT` = null, `CLAIMANT_TIME` = null where `AO_319474_MESSAGE`.`QUEUE_ID` = 1214021 and `AO_319474_MESSAGE`.`CLAIMANT` is not null
            Aug 19 07:05:10 localhost mysqld[11332]: Connection ID (thread ID): 12046760

            Stephan Vos added a comment - We experiencing a crash that seems to be related to this. (Service Management Server 4.15) There is contention on the Queue and Message tables which seems to cause either regular Deadlock or sometimes results in a MySQL crash. Aug 19 07:05:10 localhost mysqld [11332] : Some pointers may be invalid and cause the dump to abort. Aug 19 07:05:10 localhost mysqld [11332] : Query (0x7f34040107c0): update `AO_319474_MESSAGE` set `CLAIMANT` = null, `CLAIMANT_TIME` = null where `AO_319474_MESSAGE`.`QUEUE_ID` = 1214021 and `AO_319474_MESSAGE`.`CLAIMANT` is not null Aug 19 07:05:10 localhost mysqld [11332] : Connection ID (thread ID): 12046760

            Alex Cooksey added a comment - - edited
            Atlassian Update – 30 June 2021

            Hi everyone,

            Thank you for your feedback regarding this bug and bringing it to our attention. Since this bug has been re-opened in May we have resumed our investigation into this issue.

            Currently we're unable to reproduce this issue and are looking to work with customers directly affected who are able to reproduce the problem.

            If you're able to reproduce the issue, can you please open a support ticket via https://support.atlassian.com, and let the support engineer know about this ticket and this message. We'll work with you to make sure the development team has the information they need to begin working on resolving it.

            Please let us know if you have any concerns with the above.

            Thank you,

            Alex

            Alex Cooksey added a comment - - edited Atlassian Update – 30 June 2021 Hi everyone, Thank you for your feedback regarding this bug and bringing it to our attention. Since this bug has been re-opened in May we have resumed our investigation into this issue. Currently we're unable to reproduce this issue and are looking to work with customers directly affected who are able to reproduce the problem. If you're able to reproduce the issue, can you please open a support ticket via https://support.atlassian.com , and let the support engineer know about this ticket and this message. We'll work with you to make sure the development team has the information they need to begin working on resolving it. Please let us know if you have any concerns with the above. Thank you, Alex

            Hi Denise,
            Thank you!
            Sorry for bothering you.

            Gonchik Tsymzhitov added a comment - Hi Denise, Thank you! Sorry for bothering you.

            Issue occurring in JSD 4.15

            Andrea Hakim added a comment - Issue occurring in JSD 4.15

            To be clear, I am not on the Jira Development team, so I can't update the fix version, and I can't fix this problem myself. We are experiencing this on an internal Jira that my team owns, so I have reopened it for the Jira team to re-triage.

            Denise Unterwurzacher [Atlassian] (Inactive) added a comment - To be clear, I am not on the Jira Development team, so I can't update the fix version, and I can't fix this problem myself. We are experiencing this on an internal Jira that my team owns, so I have reopened it for the Jira team to re-triage.

            dunterwurzacher Could you clean a fix version please ?

            Gonchik Tsymzhitov added a comment - dunterwurzacher Could you clean a fix version please ?

            dunterwurzacher Thank you!

            Gonchik Tsymzhitov added a comment - dunterwurzacher Thank you!

            Reopening this for the Jira team to triage again, as it seems to still be occurring for a lot of folks in later versions.

            Denise Unterwurzacher [Atlassian] (Inactive) added a comment - Reopening this for the Jira team to triage again, as it seems to still be occurring for a lot of folks in later versions.

            Curious what the workaround would do for those not running Service Desk.  Is there still a benefit of adding sd.internal.base.off.thread.on.completion.events.disabled?

            Kevin Dalton added a comment - Curious what the workaround would do for those not running Service Desk.  Is there still a benefit of adding sd.internal.base.off.thread.on.completion.events.disabled?

              mchandra@atlassian.com Mohil Chandra
              dazabani Delan Azabani (Inactive)
              Affected customers:
              195 This affects my team
              Watchers:
              188 Start watching this issue

                Created:
                Updated:
                Resolved: