Uploaded image for project: 'Jira Server and Data Center'
  1. Jira Server and Data Center
  2. JRASERVER-62072

Database connectivity issue causes scheduled jobs to break

    XMLWordPrintable

Details

    Description

      Summary

      Scheduled Jobs will stop being triggered by the Jira Scheduler when a database operation fails at the precise moment the Scheduler tried to trigger these jobs.

      Impact of the bug

      Examples of impacted functionalities

      The following functionalities can be impacted, since they rely on a scheduled job to run:

      • Mail Queue (mails might keep piling in the Mail Queue, since the Mail Queue service might stop being scheduled)
      • Jira Incoming Mail Handler
      • Jira Service Management (JSM) Mail Handler
      • Jira Batched Notifications
      • Jira Service Management (JSM) Notifications
      • User directory (LDAP) sync
      • Automation rules from "Automation for Jira"

      So basically, if there was a DB operation failure (or temporary DB connectivity issue) while the Jira Scheduler try to run any of these jobs, the Scheduler will simply ignore them in the future and stop running them until a re-start of the Jira application (or the impacted Jira node) is done.

      Note that the list above is not exhaustive and that any other functionality that relies on a scheduled job might be impacted, if the database connection/operation error occurs right at the time the job was supposed to be scheduled.

      Difference of the impact between Jira Server and Data Center

      The impact of the bug is different depending if you are using Jira Server (or Data Center single node), or if you are using Data Center multi node:

      • For Jira Server / Jira Data Center (JDC) single node
        • any scheduled jobs can be impacted
      • For Jira Data Center (JDC) multi node
        • only the scheduled jobs executed locally on each node such as the Mail Queue Service will be impacted

      The reason behind this difference between single node vs multi node is because on JDC multi-node environments:

      • most jobs are executed using the cluster lock system (so called "beehive"). These jobs can only be run by 1 node at a time, and there is a logic that automatically unlocks these jobs in case they get stuck due to a database operation failure, which was implemented in Jira 8.3.0 as per the bug JIRA DC might lose Cluster lock due database connectivity problems
      • some jobs (such as the mail queue service) are not using the cluster lock system, and are executed "locally". This means that each Jira node has an instance of this job and this job can be executed simultaneously by any node

      For more information about the difference between the jobs using the cluster lock (beehive) system and the jobs run locally, you can refer to the developer page Developing for high availability and clustering.

      Consequence on the Mail Queue

      Since any type of notification (Jira batched/non-batched notifications, JSM customer notifications) rely on the Mail Queue Service to be sent from the Mail Queue, if the Mail Queue job gets impacted by this bug, then the following will happen:

      • For Jira Server / Jira Data Center (JDC) single node
        • The Mail Queue will keep piling up until it is manually flushed by a Jira admin
        • Notifications (of any type) will completely stop being sent
      • For Jira Data Center (JDC) multi node
        • The Mail Queue will keep piling on only on the impacted Jira nodes (since each Jira node is managing its own Mail Queue Service)
        • Notifications (of any type) will intermittently not being sent, depending on which Jira node the notification was triggered from:
          • If the notification was triggered from a node with a functioning Mail Queue service, it will be sent as expected
          • If the notification was triggered from a node with a non-functioning Mail Queue service (due to this bug), it will not be sent and be stuck in the mail queue until it's manually flushed by a Jira admin

      Environment

      • Jira Server / Jira Data Center single node (any scheduled jobs can be impacted)
      • Jira Data Center multi node (only for jobs executed locally such as the Mail Queue Service can be impacted)

      Steps to Reproduce

      1. Schedule a Job
      2. Introduce a breakpoint at CaesiumSchedulerService.executeClusteredJob
      3. Interrupt database connectivity

      Actual Results

      Jira will lose track of the job and will never execute it again, unless a restart is performed.

      Also, there will be an error recorded in the Jira logs at the time where the Jira scheduler tried to trigger the scheduled job, but failed due to a DB connection/operation failure, similar to any of the error listed below:

      • Example 1 (failure to schedule the mail queue service which job id is com.atlassian.jira.service.JiraService:10000):
        2022-03-05 11:42:00,729 Caesium-1-3 ERROR ServiceRunner     [c.a.s.caesium.impl.SchedulerQueueWorker] Unhandled exception thrown by job QueuedJob[jobId=com.atlassian.jira.service.JiraService:10000,deadline=1646509320000]
        com.opensymphony.module.propertyset.PropertyImplementationException: Unable to load values for CacheKey[entityName=jira.properties,entityId=1]
        	at com.atlassian.jira.propertyset.CachingOfBizPropertyEntryStore.propEx(CachingOfBizPropertyEntryStore.java:374)
        	at com.atlassian.jira.propertyset.CachingOfBizPropertyEntryStore.resolve(CachingOfBizPropertyEntryStore.java:128)
        	at com.atlassian.jira.propertyset.CachingOfBizPropertyEntryStore.getEntry(CachingOfBizPropertyEntryStore.java:151)
        	at com.atlassian.jira.propertyset.CachingOfBizPropertySet.get(CachingOfBizPropertySet.java:189)
        	at com.opensymphony.module.propertyset.AbstractPropertySet.getString(AbstractPropertySet.java:305)
        	at com.atlassian.jira.config.properties.ApplicationPropertiesStore.getStringFromDb(ApplicationPropertiesStore.java:234)
        	at com.atlassian.jira.config.properties.ApplicationPropertiesImpl.getString(ApplicationPropertiesImpl.java:53)
        	at com.atlassian.jira.scheduler.JiraCaesiumSchedulerConfiguration.getDefaultTimeZone(JiraCaesiumSchedulerConfiguration.java:30)
        	at com.atlassian.scheduler.caesium.impl.RunTimeCalculator.getTimeZone(RunTimeCalculator.java:115)
        	at com.atlassian.scheduler.caesium.impl.RunTimeCalculator.nextRunTime(RunTimeCalculator.java:96)
        	at com.atlassian.scheduler.caesium.impl.RunTimeCalculator.nextRunTime(RunTimeCalculator.java:70)
        	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.calculateNextRunTime(CaesiumSchedulerService.java:444)
        	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeLocalJob(CaesiumSchedulerService.java:401)
        	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeQueuedJob(CaesiumSchedulerService.java:380)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeJob(SchedulerQueueWorker.java:66)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:60)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:35)
        	at java.lang.Thread.run(Thread.java:748)
        Caused by: com.atlassian.cache.CacheException: com.atlassian.jira.exception.DataAccessException: com.microsoft.sqlserver.jdbc.SQLServerException: Cannot open database "MSSQLJiraDBprod" requested by the login. The login failed. ClientConnectionId:29ea07a4-6e17-4ffa-9602-940d44963f71
        	at com.atlassian.cache.ehcache.DelegatingCache.get(DelegatingCache.java:113)
        	at com.atlassian.jira.propertyset.CachingOfBizPropertyEntryStore.resolve(CachingOfBizPropertyEntryStore.java:126)
        	... 16 more
        Caused by: com.atlassian.jira.exception.DataAccessException: com.microsoft.sqlserver.jdbc.SQLServerException: Cannot open database "MSSQLJiraDBprod" requested by the login. The login failed. ClientConnectionId:29ea07a4-6e17-4ffa-9602-940d44963f71
        	at com.atlassian.jira.database.DatabaseAccessorImpl.borrowConnection(DatabaseAccessorImpl.java:167)
        	at com.atlassian.jira.database.DefaultQueryDslAccessor$1.executeQuery(DefaultQueryDslAccessor.java:84)
        	at com.atlassian.jira.propertyset.CachingOfBizPropertyEntryStore.query(CachingOfBizPropertyEntryStore.java:326)
        ...
        
        	... 17 more
        Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Cannot open database "MSSQLJiraDBprod" requested by the login. The login failed. ClientConnectionId:29ea07a4-6e17-4ffa-9602-940d44963f71
        	at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
        	at com.microsoft.sqlserver.jdbc.TDSTokenHandler.onEOF(tdsparser.java:283)
        	at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:129)
        	at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:37)
        	at com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:5333)
        ...
        
      • Example 2 (failure to execute the Batched Notification job when using a Postgres DB) :
        2020-12-05 15:07:52,615-0600 Caesium-1-1 ERROR ServiceRunner     [c.a.s.caesium.impl.SchedulerQueueWorker] Unhandled exception thrown by job QueuedJob[jobId=com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJobSchedulerImpl.mentions,deadline=1607202432603]
        java.lang.reflect.InvocationTargetException
        	at sun.reflect.GeneratedMethodAccessor404.invoke(Unknown Source)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        	at java.lang.reflect.Method.invoke(Method.java:498)
        	at 
        ...
        com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeClusteredJob(CaesiumSchedulerService.java:409)
        	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeClusteredJobWithRecoveryGuard(CaesiumSchedulerService.java:454)
        	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeQueuedJob(CaesiumSchedulerService.java:382)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeJob(SchedulerQueueWorker.java:66)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:60)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:35)
        	at java.lang.Thread.run(Thread.java:748)
        Caused by: com.atlassian.jira.exception.DataAccessException: org.ofbiz.core.entity.GenericDataSourceException: Unable to establish a connection with the database. (The connection attempt failed.)
        	at com.atlassian.jira.ofbiz.DefaultOfBizDelegator.findListIteratorByCondition(DefaultOfBizDelegator.java:408)
        	at com.quisapps.jira.fieldsecurity.ofbiz.SecureOfBizDelegator.findListIteratorByCondition(SecureOfBizDelegator.java:309)
        	... 17 more
        Caused by: org.ofbiz.core.entity.GenericDataSourceException: Unable to establish a connection with the database. (The connection attempt failed.)
        	at org.ofbiz.core.entity.jdbc.SQLProcessor.getConnection(SQLProcessor.java:343)
        	at org.ofbiz.core.entity.GenericDAO.createEntityListIterator(GenericDAO.java:870)
        	at org.ofbiz.core.entity.GenericDAO.selectListIteratorByCondition(GenericDAO.java:857)
        	at org.ofbiz.core.entity.GenericHelperDAO.findListIteratorByCondition(GenericHelperDAO.java:216)
        	at org.ofbiz.core.entity.GenericDelegator.findListIteratorByCondition(GenericDelegator.java:1243)
        	at com.atlassian.jira.ofbiz.DefaultOfBizDelegator.findListIteratorByCondition(DefaultOfBizDelegator.java:405)
        	... 18 more
        Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
        	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:292)
        	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
        	at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:211)
        	at org.postgresql.Driver.makeConnection(Driver.java:458)
        ...
        com.atlassian.jira.ofbiz.sql.JiraSupportedDatabasesCompatibleJNDIFactory.getConnection(JiraSupportedDatabasesCompatibleJNDIFactory.java:38)
        	at org.ofbiz.core.entity.TransactionFactory.getConnection(TransactionFactory.java:114)
        	at org.ofbiz.core.entity.ConnectionFactory.getConnection(ConnectionFactory.java:59)
        	at org.ofbiz.core.entity.jdbc.SQLProcessor.getConnection(SQLProcessor.java:340)
        	... 24 more
        Caused by: java.net.SocketTimeoutException: connect timed out
        	at java.net.PlainSocketImpl.socketConnect(Native Method)
        	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        ...
        
      • Example 3 (failure to execute the Batched Notification job when using a MS SQL Server DB):
        2020-06-28 07:37:40,642+0200 Caesium-1-4 ERROR ServiceRunner     [c.a.s.caesium.impl.CaesiumSchedulerService] Unhandled exception during the attempt to execute job 'com.atlassian.jira.plugins.inform.batching.cron.BatchNotificationJobSchedulerImpl.mentions'; will attempt recovery in 60 seconds
        com.atlassian.jira.exception.DataAccessException: org.ofbiz.core.entity.GenericDataSourceException: SQL Exception while executing the following:SELECT ID, JOB_ID, JOB_RUNNER_KEY, SCHED_TYPE, INTERVAL_MILLIS, FIRST_RUN, CRON_EXPRESSION, TIME_ZONE, NEXT_RUN, VERSION, PARAMETERS FROM dbo.clusteredjob WHERE JOB_ID=? (Connection reset)
        	at com.atlassian.jira.ofbiz.DefaultOfBizDelegator.findListIteratorByCondition(DefaultOfBizDelegator.java:408)
        	at com.atlassian.jira.ofbiz.WrappingOfBizDelegator.findListIteratorByCondition(WrappingOfBizDelegator.java:283)
        	at com.atlassian.jira.entity.SelectQueryImpl$ExecutionContextImpl.forEach(SelectQueryImpl.java:227)
        	at com.atlassian.jira.entity.SelectQueryImpl$ExecutionContextImpl.consumeWith(SelectQueryImpl.java:214)
        	at com.atlassian.jira.entity.SelectQueryImpl$ExecutionContextImpl.singleValue(SelectQueryImpl.java:191)
        	at com.atlassian.jira.scheduler.OfBizClusteredJobDao.find(OfBizClusteredJobDao.java:88)
        	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeClusteredJob(CaesiumSchedulerService.java:409)
        	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeClusteredJobWithRecoveryGuard(CaesiumSchedulerService.java:454)
        	at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeQueuedJob(CaesiumSchedulerService.java:382)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeJob(SchedulerQueueWorker.java:66)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:60)
        	at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:35)
        	at java.lang.Thread.run(Thread.java:748)
        Caused by: org.ofbiz.core.entity.GenericDataSourceException: SQL Exception while executing the following:SELECT ID, JOB_ID, JOB_RUNNER_KEY, SCHED_TYPE, INTERVAL_MILLIS, FIRST_RUN, CRON_EXPRESSION, TIME_ZONE, NEXT_RUN, VERSION, PARAMETERS FROM dbo.clusteredjob WHERE JOB_ID=? (Connection reset)
        	at org.ofbiz.core.entity.jdbc.SQLProcessor.executeQuery(SQLProcessor.java:533)
        	at org.ofbiz.core.entity.GenericDAO.createEntityListIterator(GenericDAO.java:877)
        	at org.ofbiz.core.entity.GenericDAO.selectListIteratorByCondition(GenericDAO.java:857)
        	at org.ofbiz.core.entity.GenericHelperDAO.findListIteratorByCondition(GenericHelperDAO.java:216)
        	at org.ofbiz.core.entity.GenericDelegator.findListIteratorByCondition(GenericDelegator.java:1243)
        	... 12 more
        Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset
        	at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:2887)
        	at com.microsoft.sqlserver.jdbc.TDSChannel.write(IOBuffer.java:2045)
        	at com.microsoft.sqlserver.jdbc.TDSWriter.flush(IOBuffer.java:4146)
        

      Expected Results.

      Jira will run the job under its schedule as soon database connectivity is resumed

      Workaround

      • For Jira Server / Jira Data Center (JDC) single node
        • Restart the Jira application
      • For Jira Data Center (JDC) multi node
        • Restart the node impacted by the bug

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ohernandez@atlassian.com Oswaldo Hernandez (Inactive)
              Votes:
              29 Vote for this issue
              Watchers:
              67 Start watching this issue

              Dates

                Created:
                Updated: