Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-53381

Frequent deadlocks on the scheduler_run_details table can cause an outage

    XMLWordPrintable

Details

    Description

      Resolution

      A Scheduled Job called 'Purge Old Job Run Details' has been introduced to purge scheduled job history

      Summary

      Frequent deadlocks on the scheduler_run_details table can cause an outage.

      The scheduler_run_details table holds the job execution history for both clustered and non-clustered jobs. When a deadlock occurs on this table it prevents the job history from being updated and eventually leads to an outage.

      Environment

      • Confluence Data Center
      • Confluence server Standalone
      • Microsoft SQL Server
      • MySQL database
      • PostgreSQL database
      • Large number of records in the scheduler_run_details table

      Symptoms

      Application logs

      ERROR [Caesium-1-2] [scheduler.caesium.impl.SchedulerQueueWorker] executeJob Unhandled exception thrown by job QueuedJob[jobId=LocalTaskQueueFlushJob,deadline=1502117974606]
      org.springframework.dao.DeadlockLoserDataAccessException: PreparedStatementCallback; SQL [DELETE FROM scheduler_run_details WHERE job_id = ? AND start_time < ?]; Transaction (Process ID 691) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.; nested exception is java.sql.SQLException: Transaction (Process ID 691) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
      
      WARN [Gemini Blueprint context shutdown thread1] [impl.schedule.caesium.SchedulerClusteredJobDaoProxyWrapper] delete Could not delete clustered job 'synchronyStatusCheck'
      org.springframework.transaction.CannotCreateTransactionException: Could not open Hibernate Session for transaction; nested exception is java.lang.NullPointerException
      ...
      Caused by: java.lang.NullPointerException
      

      The job name and ID will vary, as this can happen to any non-clustered job in the Confluence cluster.

      DB logs

      The DB error logs will look similar to what you see here:

      spid29s     deadlock-list
      spid29s      deadlock victim=process366e253848
      ...
      spid29s        process id=process366e253848 taskpriority=0 logused=1156 waitresource=KEY: 8:72057594177191936 (f0b68475683a) waittime=248 ownerId=119979139 transactionname=implicit_transaction lasttranstarted=2017-08-20T19:40:00.013 XDES=0x34cf387240 lockMode=U schedulerid=7 kpid=1004 status=suspended spid=772 sbid=0 ecid=0 priority=0 trancount=2 lastbatchstarted=2017-08-20T19:40:00.007 lastbatchcompleted=2017-08-20T19:40:00.007 lastattention=1900-01-01T00:00:00.007 clientapp=jTDS hostname=R0091KP0C hostpid=123 loginname=jirm70m isolationlevel=read committed (2) xactid=119979139 currentdb=8 lockTimeout=4294967295 clientoption1=671088672 clientoption2=128058
      ...
      spid29s     (@P0 nvarchar(4000),@P1 datetime)DELETE FROM scheduler_run_details WHERE job_id =  @P0  AND start_time <  @P1     
      ...
      

      Confirm if there is table scanning by running an execution plan against the delete queries - see Displaying Graphical Execution Plans (SQL Server Management Studio) from MS documentation for details of how this can be done in your environment.

      Details of the problem

      It is possible that multiple nodes are updating the table concurrently, because the table is not for clustered job.

      The current implementation of SchedulerRunDetailsDao increases contention as it does purge (delete outdated job details) every time when a new record is added to the table:

          @Override
          public void addRunDetails(JobId jobId, RunDetails runDetails) {
              purgeOldHistory();
              purgeJobHistory(jobId, runDetails.getRunOutcome() == RunOutcome.SUCCESS);
      
              SchedulerRunDetails record = SchedulerRunDetails.fromRunDetails(jobId, runDetails);
              save(record);
          }
      
         private void purgeJobHistory(JobId jobId, boolean success) {
              Date threshold = new Date(System.currentTimeMillis() - PURGE_JOB_OFFSET_MILLIS);
              if (success) {
                  // The current write is a SUCCESS, purge everything older than threshold.
                  JdbcTemplate jdbc = DataAccessUtils.getJdbcTemplate(getSession());
                  jdbc.update(
                          "DELETE FROM scheduler_run_details WHERE job_id = ? AND start_time < ?",
                          new Object[]{jobId.toString(), threshold}
                  );
              } else {
                  JdbcTemplate jdbc = DataAccessUtils.getJdbcTemplate(getSession());
                  String successOutcome = String.valueOf(SchedulerRunDetails.runOutcomeToChar(RunOutcome.SUCCESS));
                  jdbc.update(
                          "DELETE FROM scheduler_run_details WHERE job_id = ? AND start_time < ? AND outcome <> ?",
                          new Object[]{jobId.toString(), threshold, successOutcome}
                  );
              }
          }
      

      Normally, the database should be able to handle these kind of concurrent transactions (as we are deleting things rather than updating the same record). However, it is not handled perfectly by MS SQL Server or MySQL. The situation is escalated and one transaction is chosen to be rolled back as a deadlock victim. This may lead to other symptoms.

      These are two queries frequently being deadlocked:

      DELETE FROM scheduler_run_details WHERE job_id =  @P0  AND start_time <  @P1
      DELETE FROM scheduler_run_details WHERE job_id =  @P0  AND start_time <  @P1  AND outcome <>  @P2
      

      The lack of the indexes is very likely to cause a full table scan during the execution of the delete statements. Adding the indexes in the workaround should significantly reduce the chance of deadlocks.

      And another thing is, the design of addRunDetails is not optimal. The purge should be implemented as a cluster wide job rather than part of the add logic. It would eliminate contention completely. Or at least it should be made less frequent (on a daily basis, for example).

      Workaround

      Option 1

      Add the following 2 composite indexes to the SCHEDULER_RUN_DETAILS table, based on the predicates of the two delete statements.

      CREATE NONCLUSTERED INDEX IX_scheduler_run_details_jobib_starttime ON confluence.scheduler_run_details (job_id, start_time);  
      CREATE NONCLUSTERED INDEX IX_scheduler_run_details_jobib_starttime_outcome ON confluence.scheduler_run_details (job_id, start_time, outcome);  
      

      Option 2

      After performing the above, deadlocks are still possible if the size of the table is very large (i.e. over 100k rows) due to the frequency of the purge request.

      You can try deleting entries in the scheduler_run_details older than 1 day. The SQL query to do this would be something like below, replacing the date with one day ago:

      DELETE FROM scheduler_run_details WHERE start_time < ?
      

      Important: You should not run this in production without first testing in your staging environment. You should also take a database backup or dump prior to running this in production. If anything does not work as expected, you can then restore from the backup.

      Option 3

      You may also try to apply one of the below DB changes:

      1. Disable lock escalation for the table scheduler_run_details
      2. Disable page lock for the table scheduler_run_details

      The final option would prevent the possibility of a deadlock caused by concurrent purges. Since the purged jobs are obsolete, this shouldn't be a problem. See also the below articles for more details on this:

      Your DBA may have some experience in these options too.

      Attachments

        Issue Links

          Activity

            People

              ttranminh Tam Tran
              jwyllys Justin W.
              Votes:
              5 Vote for this issue
              Watchers:
              34 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: