Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-71350

Batching notifications tables can lead to high CPU usage on MSSQL and cause Batched Notifications to be sent with a long delay

    XMLWordPrintable

Details

    • 8
    • 119
    • Severity 2 - Major
    • 213
    • Hide
      Atlassian Update – 23 February 2024

      Dear Customers,

      Thank you for taking the time to file and comment on this issue. We realize it still occurs and impacts your organization. We are now working on multiple customer requests and on new features, so we have to postpone our resolution of this issue. We’ve decided to move this issue to our long-term backlog.
      Please continue watching this ticket for future updates and changes in the timeline that impact your work.

      Best regards,
      Szymon Rachański
      Software Engineer

      Show
      Atlassian Update – 23 February 2024 Dear Customers, Thank you for taking the time to file and comment on this issue. We realize it still occurs and impacts your organization. We are now working on multiple customer requests and on new features, so we have to postpone our resolution of this issue. We’ve decided to move this issue to our long-term backlog. Please continue watching this ticket for future updates and changes in the timeline that impact your work. Best regards, Szymon Rachański Software Engineer

    Description

      Problem

      Batched notifications execute periodical tasks (every 14 days) to clean up old events from the AO_733371_EVENT and AO_733371_EVENT_RECIPIENT tables. However, on unknown circumstances, this leads to extremely slow queries on the database if using Microsoft SQL Server. This issue seems to happen more often on older MSSQL versions such as 2012 and 2014.

      The issue seems to be related to JRASERVER-65168 where it was described that single column indexes used by Jira might lead to this problem. Even though that specific bug is resolved, it wasn't created multi-column indexes for the batched notifications tables.

      The same problem can cause delays in batched notifications delivery.

      Example query:

      select "AO_733371_EVENT"."ID", "AO_733371_EVENT"."EVENT_TYPE", "AO_733371_EVENT"."USER_KEY", "AO_733371_EVENT"."ACTION", 
      "AO_733371_EVENT"."ACTION_ID", "AO_733371_EVENT"."EVENT_BUNDLE_ID", "AO_733371_EVENT"."CREATED" 
      from "jiraschema"."AO_733371_EVENT" "AO_733371_EVENT" 
      
      where exists (
      	select "AO_733371_EVENT_RECIPIENT"."ID" 
      	from "jiraschema"."AO_733371_EVENT_RECIPIENT" "AO_733371_EVENT_RECIPIENT" 
      	where "AO_733371_EVENT_RECIPIENT"."EVENT_ID" = "AO_733371_EVENT"."ID" and "AO_733371_EVENT_RECIPIENT"."STATUS" = 'NEW'
      	and "AO_733371_EVENT_RECIPIENT"."CONSUMER_NAME" = 'mailMentionEventConsumer'
      	and "AO_733371_EVENT_RECIPIENT"."SEND_DATE" <= '2020-04-29 06:14:41') 
      and ("AO_733371_EVENT"."EVENT_TYPE" 
      	in (92638,92593,92594,92594,92595,92596,92597,92599,92599,92600)) 
      and "AO_733371_EVENT"."ID" >= 85125
      order by "AO_733371_EVENT"."ID" asc offset @P24 rows fetch next @P25 rows only
      

      Steps to Reproduce

      1. Set up Batched Notifications (enabled by default on Jira 8.0)
      2. Have a large number of email notifications occurring daily
      3. Monitor the database CPU usage with graphs

      Expected Results

      • The cleanup tasks occur without affecting the database performance, as it does on other DBMSs such as Postgres.
      • The Jira Batched Notifications are sent in a timely manner (based on the frequency setting)

      Actual Results

      • The cleanup tasks take several hours to complete, with the CPU spiking to 100% usage during that period. General slowness is experienced on Jira due to this.
      • The Jira Batched Notifications are sent with hours of delay

      Note

      When analyzing thread dumps, we should see 2 Caesium threads running batched notification jobs:

      • Thread 1
        "Caesium-1-2" #276 daemon prio=5 os_prio=0 tid=0x00007feae318d000 nid=0x85f0 runnable [0x00007fe7c2578000]
           java.lang.Thread.State: RUNNABLE
        	at java.net.SocketInputStream.socketRead0(Native Method)
        	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        	at java.net.SocketInputStream.read(SocketInputStream.java:171)
        	at java.net.SocketInputStream.read(SocketInputStream.java:141)
        	at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023)
        	at com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418)
        	- locked <0x00007ff035c7c3d8> (a com.microsoft.sqlserver.jdbc.TDSReader)
        	at com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579)
        	at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:590)
        	at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:522)
        	at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194)
        	at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2930)
        	- locked <0x00007feccf08a6e8> (a java.lang.Object)
        	at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248)
        	at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223)
        	at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:444)
        ...
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.lambda$searchEvents$10(BatchNotificationJob.java:266)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob$$Lambda$7455/1356348673.apply(Unknown Source)
        	at com.atlassian.jira.plugins.inform.performance.MeasurementWorkerFactory$1.measure(MeasurementWorkerFactory.java:39)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.searchEvents(BatchNotificationJob.java:264)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.lambda$notifyUsers$1(BatchNotificationJob.java:112)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob$$Lambda$7454/2336314.apply(Unknown Source)
        	at com.atlassian.jira.plugins.inform.performance.MeasurementWorkerFactory$1.measure(MeasurementWorkerFactory.java:39)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.notifyUsers(BatchNotificationJob.java:109)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.lambda$runJob$0(BatchNotificationJob.java:86)
        ...
        
        	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeQueuedJob(CaesiumSchedulerService.java:382)
        	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService$$Lambda$2167/528009948.accept(Unknown Source)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeJob(SchedulerQueueWorker.java:66)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:60)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:35)
        	at java.lang.Thread.run(Thread.java:748)
        	
      • Thread 2:
        "Caesium-1-1" #275 daemon prio=5 os_prio=0 tid=0x00007feae318c800 nid=0x85ef runnable [0x00007fe7c48a1000]
           java.lang.Thread.State: RUNNABLE
        	at java.net.SocketInputStream.socketRead0(Native Method)
        	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        	at java.net.SocketInputStream.read(SocketInputStream.java:171)
        	at java.net.SocketInputStream.read(SocketInputStream.java:141)
        	at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023)
        	at com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418)
        	- locked <0x00007ff476800db8> (a com.microsoft.sqlserver.jdbc.TDSReader)
        	at com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579)
        	at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:590)
        	at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:522)
        	at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194)
        	at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2930)
        	- locked <0x00007feec8026738> (a java.lang.Object)
        	at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248)
        	at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223)
        	at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:444)
        	at org.apache.commons.dbcp2.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:83)
        	at org.apache.commons.dbcp2.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:83)
        	at com.atlassian.jira.ofbiz.sql.PreparedStatementWrapper.executeQuery(PreparedStatementWrapper.java:42)
        	at com.atlassian.jira.diagnostic.connection.DiagnosticPreparedStatement.lambda$executeQuery$5(DiagnosticPreparedStatement.java:59)
        ...
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.lambda$searchEvents$10(BatchNotificationJob.java:266)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob$$Lambda$7455/1356348673.apply(Unknown Source)
        	at com.atlassian.jira.plugins.inform.performance.MeasurementWorkerFactory$1.measure(MeasurementWorkerFactory.java:39)
        ...
        	at com.atlassian.jira.plugins.inform.performance.MeasurementWorkerFactory$1.measure(MeasurementWorkerFactory.java:39)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.runJob(BatchNotificationJob.java:84)
        	at com.atlassian.jira.plugins.inform.batching.cron.ConditionalJobRunner.runJob(ConditionalJobRunner.java:33)
        	at com.atlassian.jira.plugins.inform.batching.cron.ConditionalJobRunner.runJob(ConditionalJobRunner.java:33
        ...
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:60)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:35)
        	at java.lang.Thread.run(Thread.java:748)
        

      When analyzing long running queries on the DB side, we should see long queries that look like this:

      select "AO_733371_EVENT"."ID", "AO_733371_EVENT"."EVENT_TYPE", "AO_733371_EVENT"."USER_KEY", "AO_733371_EVENT"."ACTION", "AO_733371_EVENT"."ACTION_ID", "AO_733371_EVENT"."EVENT_BUNDLE_ID", "AO_733371_EVENT"."CREATED" from "dbo"."AO_733371_EVENT" "AO_733371_EVENT" where exists (select "AO_733371_EVENT_RECIPIENT"."ID" from "dbo"."AO_733371_EVENT_RECIPIENT" "AO_733371_EVENT_RECIPIENT" where "AO_733371_EVENT_RECIPIENT"."EVENT_ID" = "AO_733371_EVENT"."ID" and "AO_733371_EVENT_RECIPIENT"."STATUS" = @P0 and "AO_733371_EVENT_RECIPIENT"."CONSUMER_NAME" = @P1 and "AO_733371_EVENT_RECIPIENT"."SEND_DATE" <= @P2) and ("AO_733371_EVENT"."EVENT_TYPE" in (@P3, @P4, @P5, @P6, @P7, @P8, @P9, @P10, @P11, @P12, @P13, @P14, @P15, @P16, @P17, @P18, @P19, @P20, @P21, @P22, @P23, @P24, @P25, @P26, @P27)) and "AO_733371_EVENT"."ID" >= @P28 order by "AO_733371_EVENT"."ID" asc offset @P29 rows fetch next @P30 rows only
      

      Workaround

      No confirmed workaround available.

      1. Users that have upgrade to newer MSSQL versions (such as 2017) noticed that the occurrence of the problem is smaller.
      2. If you're using newer MSSQL versions and experiencing this problem, verify the compatibility mode set to this database. If it was upgraded from previous MSSQL versions, it might be set to mode 120/110, and there was improved query execution on MSSQL 2017 as can be verified on this Microsoft document

      Attachments

        1. AO_733371 (1).sqlplan
          52 kB
        2. dbQueryTool.jar
          15.53 MB
        3. jira-inform-backdoor-plugin-1.3.6.jar
          1.07 MB
        4. jira-inform-backdoor-plugin-1.3.6-1.jar
          1.07 MB
        5. screenshot-1.png
          screenshot-1.png
          537 kB
        6. Screenshot 2020-09-14 at 14.56.48.jpg
          Screenshot 2020-09-14 at 14.56.48.jpg
          673 kB
        7. Screenshot 2020-09-14 at 14.58.30.jpg
          Screenshot 2020-09-14 at 14.58.30.jpg
          689 kB

        Issue Links

          Activity

            People

              28aa227a1664 Szymon Rachański
              rbaldasso Rodrigo Baldasso
              Votes:
              42 Vote for this issue
              Watchers:
              78 Start watching this issue

              Dates

                Created:
                Updated: