Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: High
Fix Version/s: 6.4.3
Affects Version/s: 5.10.8, 6.2.1, 6.2.4
Component/s: Data Center - Core, Server - Jobs
Labels:
- ee
- enterprise
- l1l2
- pse-request
- scale-team

Support reference count:
11
Symptom Severity:
Severity 1 - Critical
UIS:
120

Resolution

A Scheduled Job called 'Purge Old Job Run Details' has been introduced to purge scheduled job history

Summary

Frequent deadlocks on the scheduler_run_details table can cause an outage.

The scheduler_run_details table holds the job execution history for both clustered and non-clustered jobs. When a deadlock occurs on this table it prevents the job history from being updated and eventually leads to an outage.

Environment

Confluence Data Center
Confluence server Standalone
Microsoft SQL Server
- Transaction isolation level READ_COMMITTED_SNAPSHOT
- Collation has been verified correct
MySQL database
PostgreSQL database
Large number of records in the scheduler_run_details table

Symptoms

Application logs

ERROR [Caesium-1-2] [scheduler.caesium.impl.SchedulerQueueWorker] executeJob Unhandled exception thrown by job QueuedJob[jobId=LocalTaskQueueFlushJob,deadline=1502117974606]
org.springframework.dao.DeadlockLoserDataAccessException: PreparedStatementCallback; SQL [DELETE FROM scheduler_run_details WHERE job_id = ? AND start_time < ?]; Transaction (Process ID 691) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.; nested exception is java.sql.SQLException: Transaction (Process ID 691) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

WARN [Gemini Blueprint context shutdown thread1] [impl.schedule.caesium.SchedulerClusteredJobDaoProxyWrapper] delete Could not delete clustered job 'synchronyStatusCheck'
org.springframework.transaction.CannotCreateTransactionException: Could not open Hibernate Session for transaction; nested exception is java.lang.NullPointerException
...
Caused by: java.lang.NullPointerException

The job name and ID will vary, as this can happen to any non-clustered job in the Confluence cluster.

DB logs

The DB error logs will look similar to what you see here:

spid29s     deadlock-list
spid29s      deadlock victim=process366e253848
...
spid29s        process id=process366e253848 taskpriority=0 logused=1156 waitresource=KEY: 8:72057594177191936 (f0b68475683a) waittime=248 ownerId=119979139 transactionname=implicit_transaction lasttranstarted=2017-08-20T19:40:00.013 XDES=0x34cf387240 lockMode=U schedulerid=7 kpid=1004 status=suspended spid=772 sbid=0 ecid=0 priority=0 trancount=2 lastbatchstarted=2017-08-20T19:40:00.007 lastbatchcompleted=2017-08-20T19:40:00.007 lastattention=1900-01-01T00:00:00.007 clientapp=jTDS hostname=R0091KP0C hostpid=123 loginname=jirm70m isolationlevel=read committed (2) xactid=119979139 currentdb=8 lockTimeout=4294967295 clientoption1=671088672 clientoption2=128058
...
spid29s     (@P0 nvarchar(4000),@P1 datetime)DELETE FROM scheduler_run_details WHERE job_id =  @P0  AND start_time <  @P1     
...

Confirm if there is table scanning by running an execution plan against the delete queries - see Displaying Graphical Execution Plans (SQL Server Management Studio) from MS documentation for details of how this can be done in your environment.

Details of the problem

It is possible that multiple nodes are updating the table concurrently, because the table is not for clustered job.

The current implementation of SchedulerRunDetailsDao increases contention as it does purge (delete outdated job details) every time when a new record is added to the table:

    @Override
    public void addRunDetails(JobId jobId, RunDetails runDetails) {
        purgeOldHistory();
        purgeJobHistory(jobId, runDetails.getRunOutcome() == RunOutcome.SUCCESS);

        SchedulerRunDetails record = SchedulerRunDetails.fromRunDetails(jobId, runDetails);
        save(record);
    }

   private void purgeJobHistory(JobId jobId, boolean success) {
        Date threshold = new Date(System.currentTimeMillis() - PURGE_JOB_OFFSET_MILLIS);
        if (success) {
            // The current write is a SUCCESS, purge everything older than threshold.
            JdbcTemplate jdbc = DataAccessUtils.getJdbcTemplate(getSession());
            jdbc.update(
                    "DELETE FROM scheduler_run_details WHERE job_id = ? AND start_time < ?",
                    new Object[]{jobId.toString(), threshold}
            );
        } else {
            JdbcTemplate jdbc = DataAccessUtils.getJdbcTemplate(getSession());
            String successOutcome = String.valueOf(SchedulerRunDetails.runOutcomeToChar(RunOutcome.SUCCESS));
            jdbc.update(
                    "DELETE FROM scheduler_run_details WHERE job_id = ? AND start_time < ? AND outcome <> ?",
                    new Object[]{jobId.toString(), threshold, successOutcome}
            );
        }
    }

Normally, the database should be able to handle these kind of concurrent transactions (as we are deleting things rather than updating the same record). However, it is not handled perfectly by MS SQL Server or MySQL. The situation is escalated and one transaction is chosen to be rolled back as a deadlock victim. This may lead to other symptoms.

These are two queries frequently being deadlocked:

DELETE FROM scheduler_run_details WHERE job_id =  @P0  AND start_time <  @P1
DELETE FROM scheduler_run_details WHERE job_id =  @P0  AND start_time <  @P1  AND outcome <>  @P2

The lack of the indexes is very likely to cause a full table scan during the execution of the delete statements. Adding the indexes in the workaround should significantly reduce the chance of deadlocks.

And another thing is, the design of addRunDetails is not optimal. The purge should be implemented as a cluster wide job rather than part of the add logic. It would eliminate contention completely. Or at least it should be made less frequent (on a daily basis, for example).

Workaround

Option 1

Add the following 2 composite indexes to the SCHEDULER_RUN_DETAILS table, based on the predicates of the two delete statements.

CREATE NONCLUSTERED INDEX IX_scheduler_run_details_jobib_starttime ON confluence.scheduler_run_details (job_id, start_time);  
CREATE NONCLUSTERED INDEX IX_scheduler_run_details_jobib_starttime_outcome ON confluence.scheduler_run_details (job_id, start_time, outcome);

Option 2

After performing the above, deadlocks are still possible if the size of the table is very large (i.e. over 100k rows) due to the frequency of the purge request.

You can try deleting entries in the scheduler_run_details older than 1 day. The SQL query to do this would be something like below, replacing the date with one day ago:

DELETE FROM scheduler_run_details WHERE start_time < ?

Important: You should not run this in production without first testing in your staging environment. You should also take a database backup or dump prior to running this in production. If anything does not work as expected, you can then restore from the backup.

Option 3

You may also try to apply one of the below DB changes:

Disable lock escalation for the table scheduler_run_details
Disable page lock for the table scheduler_run_details

The final option would prevent the possibility of a deadlock caused by concurrent purges. Since the purged jobs are obsolete, this shouldn't be a problem. See also the below articles for more details on this:

Your DBA may have some experience in these options too.

is blocked by: PSR-47 Loading...; PSR-76 Loading...

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(19 mentioned in)

Assignee:: Tam Tran
Reporter:: Justin W.
Votes:: 5 Vote for this issue
Watchers:: 34 Start watching this issue

Created:: 31/Aug/2017 6:42 PM
Updated:: 20/Apr/2022 8:12 AM
Resolved:: 23/Oct/2017 4:26 AM

Details

Description

Summary

Environment

Symptoms

Application logs

DB logs

Details of the problem

Workaround

Option 1

Option 2

Option 3

Attachments

Issue Links

Activity

People

Dates