Uploaded image for project: 'Jira Server and Data Center'
  1. Jira Server and Data Center
  2. JRASERVER-71350

Batching notifications tables can lead to high CPU usage on MSSQL

    XMLWordPrintable

Details

    Description

      Problem

      Batched notifications execute periodical tasks (every 14 days) to clean up old events from the AO_733371_EVENT and AO_733371_EVENT_RECIPIENT tables. However, on unknown circumstances, this leads to extremely slow queries on the database if using Microsoft SQL Server. This issue seems to happen more often on older MSSQL versions such as 2012 and 2014.

      The issue seems to be related to JRASERVER-65168 where it was described that single column indexes used by Jira might lead to this problem. Even though that specific bug is resolved, it wasn't created multi-column indexes for the batched notifications tables.

      The same problem can cause delays in batched notifications delivery.

      Example query:

      select "AO_733371_EVENT"."ID", "AO_733371_EVENT"."EVENT_TYPE", "AO_733371_EVENT"."USER_KEY", "AO_733371_EVENT"."ACTION", 
      "AO_733371_EVENT"."ACTION_ID", "AO_733371_EVENT"."EVENT_BUNDLE_ID", "AO_733371_EVENT"."CREATED" 
      from "jiraschema"."AO_733371_EVENT" "AO_733371_EVENT" 
      
      where exists (
      	select "AO_733371_EVENT_RECIPIENT"."ID" 
      	from "jiraschema"."AO_733371_EVENT_RECIPIENT" "AO_733371_EVENT_RECIPIENT" 
      	where "AO_733371_EVENT_RECIPIENT"."EVENT_ID" = "AO_733371_EVENT"."ID" and "AO_733371_EVENT_RECIPIENT"."STATUS" = 'NEW'
      	and "AO_733371_EVENT_RECIPIENT"."CONSUMER_NAME" = 'mailMentionEventConsumer'
      	and "AO_733371_EVENT_RECIPIENT"."SEND_DATE" <= '2020-04-29 06:14:41') 
      and ("AO_733371_EVENT"."EVENT_TYPE" 
      	in (92638,92593,92594,92594,92595,92596,92597,92599,92599,92600)) 
      and "AO_733371_EVENT"."ID" >= 85125
      order by "AO_733371_EVENT"."ID" asc offset @P24 rows fetch next @P25 rows only
      

      Steps to Reproduce

      1. Set up Batched Notifications (enabled by default on Jira 8.0)
      2. Have a large number of email notifications occurring daily
      3. Monitor the database CPU usage with graphs

      Expected Results

      The cleanup tasks occur without affecting the database performance, as it does on other DBMSs such as Postgres.

      Actual Results

      The cleanup tasks take several hours to complete, with the CPU spiking to 100% usage during that period. General slowness is experienced on Jira due to this.

      Note

      When analyzing thread dumps, we should see 2 Caesium threads running batched notification jobs:

      • Thread 1
        "Caesium-1-2" #276 daemon prio=5 os_prio=0 tid=0x00007feae318d000 nid=0x85f0 runnable [0x00007fe7c2578000]
           java.lang.Thread.State: RUNNABLE
        	at java.net.SocketInputStream.socketRead0(Native Method)
        	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        	at java.net.SocketInputStream.read(SocketInputStream.java:171)
        	at java.net.SocketInputStream.read(SocketInputStream.java:141)
        	at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023)
        	at com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418)
        	- locked <0x00007ff035c7c3d8> (a com.microsoft.sqlserver.jdbc.TDSReader)
        	at com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579)
        	at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:590)
        	at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:522)
        	at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194)
        	at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2930)
        	- locked <0x00007feccf08a6e8> (a java.lang.Object)
        	at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248)
        	at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223)
        	at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:444)
        ...
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.lambda$searchEvents$10(BatchNotificationJob.java:266)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob$$Lambda$7455/1356348673.apply(Unknown Source)
        	at com.atlassian.jira.plugins.inform.performance.MeasurementWorkerFactory$1.measure(MeasurementWorkerFactory.java:39)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.searchEvents(BatchNotificationJob.java:264)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.lambda$notifyUsers$1(BatchNotificationJob.java:112)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob$$Lambda$7454/2336314.apply(Unknown Source)
        	at com.atlassian.jira.plugins.inform.performance.MeasurementWorkerFactory$1.measure(MeasurementWorkerFactory.java:39)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.notifyUsers(BatchNotificationJob.java:109)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.lambda$runJob$0(BatchNotificationJob.java:86)
        ...
        
        	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeQueuedJob(CaesiumSchedulerService.java:382)
        	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService$$Lambda$2167/528009948.accept(Unknown Source)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeJob(SchedulerQueueWorker.java:66)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:60)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:35)
        	at java.lang.Thread.run(Thread.java:748)
        	
      • Thread 2:
        "Caesium-1-1" #275 daemon prio=5 os_prio=0 tid=0x00007feae318c800 nid=0x85ef runnable [0x00007fe7c48a1000]
           java.lang.Thread.State: RUNNABLE
        	at java.net.SocketInputStream.socketRead0(Native Method)
        	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        	at java.net.SocketInputStream.read(SocketInputStream.java:171)
        	at java.net.SocketInputStream.read(SocketInputStream.java:141)
        	at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2023)
        	at com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6418)
        	- locked <0x00007ff476800db8> (a com.microsoft.sqlserver.jdbc.TDSReader)
        	at com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:7579)
        	at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:590)
        	at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:522)
        	at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7194)
        	at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2930)
        	- locked <0x00007feec8026738> (a java.lang.Object)
        	at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:248)
        	at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:223)
        	at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:444)
        	at org.apache.commons.dbcp2.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:83)
        	at org.apache.commons.dbcp2.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:83)
        	at com.atlassian.jira.ofbiz.sql.PreparedStatementWrapper.executeQuery(PreparedStatementWrapper.java:42)
        	at com.atlassian.jira.diagnostic.connection.DiagnosticPreparedStatement.lambda$executeQuery$5(DiagnosticPreparedStatement.java:59)
        ...
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.lambda$searchEvents$10(BatchNotificationJob.java:266)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob$$Lambda$7455/1356348673.apply(Unknown Source)
        	at com.atlassian.jira.plugins.inform.performance.MeasurementWorkerFactory$1.measure(MeasurementWorkerFactory.java:39)
        ...
        	at com.atlassian.jira.plugins.inform.performance.MeasurementWorkerFactory$1.measure(MeasurementWorkerFactory.java:39)
        	at com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJob.runJob(BatchNotificationJob.java:84)
        	at com.atlassian.jira.plugins.inform.batching.cron.ConditionalJobRunner.runJob(ConditionalJobRunner.java:33)
        	at com.atlassian.jira.plugins.inform.batching.cron.ConditionalJobRunner.runJob(ConditionalJobRunner.java:33
        ...
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:60)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:35)
        	at java.lang.Thread.run(Thread.java:748)
        

      When analyzing long running queries on the DB side, we should see long queries that look like this:

      select "AO_733371_EVENT"."ID", "AO_733371_EVENT"."EVENT_TYPE", "AO_733371_EVENT"."USER_KEY", "AO_733371_EVENT"."ACTION", "AO_733371_EVENT"."ACTION_ID", "AO_733371_EVENT"."EVENT_BUNDLE_ID", "AO_733371_EVENT"."CREATED" from "dbo"."AO_733371_EVENT" "AO_733371_EVENT" where exists (select "AO_733371_EVENT_RECIPIENT"."ID" from "dbo"."AO_733371_EVENT_RECIPIENT" "AO_733371_EVENT_RECIPIENT" where "AO_733371_EVENT_RECIPIENT"."EVENT_ID" = "AO_733371_EVENT"."ID" and "AO_733371_EVENT_RECIPIENT"."STATUS" = @P0 and "AO_733371_EVENT_RECIPIENT"."CONSUMER_NAME" = @P1 and "AO_733371_EVENT_RECIPIENT"."SEND_DATE" <= @P2) and ("AO_733371_EVENT"."EVENT_TYPE" in (@P3, @P4, @P5, @P6, @P7, @P8, @P9, @P10, @P11, @P12, @P13, @P14, @P15, @P16, @P17, @P18, @P19, @P20, @P21, @P22, @P23, @P24, @P25, @P26, @P27)) and "AO_733371_EVENT"."ID" >= @P28 order by "AO_733371_EVENT"."ID" asc offset @P29 rows fetch next @P30 rows only
      

      Workaround

      No confirmed workaround available.

      1. Users that have upgrade to newer MSSQL versions (such as 2017) noticed that the occurrence of the problem is smaller.
      2. If you're using newer MSSQL versions and experiencing this problem, verify the compatibility mode set to this database. If it was upgraded from previous MSSQL versions, it might be set to mode 120/110, and there was improved query execution on MSSQL 2017 as can be verified on this Microsoft document

      Attachments

        1. AO_733371 (1).sqlplan
          52 kB
        2. dbQueryTool.jar
          15.53 MB
        3. jira-inform-backdoor-plugin-1.3.6.jar
          1.07 MB
        4. jira-inform-backdoor-plugin-1.3.6-1.jar
          1.07 MB
        5. screenshot-1.png
          screenshot-1.png
          537 kB
        6. Screenshot 2020-09-14 at 14.56.48.jpg
          Screenshot 2020-09-14 at 14.56.48.jpg
          673 kB
        7. Screenshot 2020-09-14 at 14.58.30.jpg
          Screenshot 2020-09-14 at 14.58.30.jpg
          689 kB

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rbaldasso Rodrigo Baldasso
              Votes:
              23 Vote for this issue
              Watchers:
              51 Start watching this issue

              Dates

                Created:
                Updated: