Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-66597

JIRA DC might lose Cluster lock due database connectivity problems

    XMLWordPrintable

Details

    • 7.02
    • 53
    • Severity 2 - Major
    • 149
    • Hide
      Atlassian Update - 17th June 2019

      Dear Jira users,

      we're glad to announce that this issue will be addressed in our upcoming 8.3.0 release!

      Cluster locks have been turned into leases that automatically expire when no longer used, making the situation described in this issue automatically recoverable.

      Thank you,
      Grazyna Kaszkur

      Product Manager,
      Jira Server and Data Center

      Show
      Atlassian Update - 17th June 2019 Dear Jira users, we're glad to announce that this issue will be addressed in our upcoming 8.3.0 release! Cluster locks have been turned into leases that automatically expire when no longer used, making the situation described in this issue automatically recoverable. Thank you, Grazyna Kaszkur
 Product Manager, Jira Server and Data Center

    Description

      Summary

      JIRA DC might lose cluster lock while trying to unlock the lock if there is database connectivity problems .

      That makes whole cluster stuck for some operations/actions that needs that lock.
      Specifically we saw problem with:

      • com.atlassian.jira.config.DefaultReindexMessageManager.messageLock lock
      • com.atlassian.upm.impl.PluginManagerPluginAsynchronousTaskStatusStoreImpl lock
      • NotificationBatchScheduler.sd_batch_sending_lock lock
      • sla_issue_update_xx lock
      • the Service Desk Mail Puller job com.atlassian.jira.internal.mail.services.MailPullerJobRunner

      So there is a possible chain of events:

      1. Node1 obtains the lock in thread1
        • Node1 registers it in DB and internal JVM structure.
      2. All other nodes see lock and wait
        • Thread1 unlocks the lock. While updating db it fails and release only local ref lock
      3. All other nodes see lock and wait
      4. Lock check at Node1 see that local ref is NULL and DB lock is present, so does nothing
      5. All stuck

      Environment

      • JIRA Datacenter

      Steps to Reproduce

      1. Set up a Datacenter JIRA instance with 2 nodes
      2. Initiate action that would obtain the lock
      3. Break connection to the database while this node has a clusterlock and tries to unlock it

      Expected Results

      JIRA will recover and do the unlock query once connected back to the database

      Actual Results

      JIRA DC lost the cluster lock. The below exception is thrown in the atlassian-jira.log file:

      2018-01-19 09:09:31,377 http-nio-8080-exec-1 ERROR bob 300x234323428x11 18tz5ra 10.0.0.1 /secure/admin/user/UserBrowser.jspa [o.a.c.c.C.[Catalina].[localhost].[/jira]] Unhandled exception occurred whilst decorating page
      java.lang.RuntimeException: org.apache.jasper.JasperException: com.atlassian.jira.exception.DataAccessException: org.ofbiz.core.entity.GenericDataSourceException: SQL Exception while executing the following:UPDATE clusterlockstatus SET  UPDATE_TIME = ? ,  LOCKED_BY_NODE = ?  WHERE LOCK_NAME=? AND LOCKED_BY_NODE=? (Communications link failure
      
      The last packet successfully received from the server was 1,339,594 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.)
      	at com.opensymphony.sitemesh.webapp.decorator.BaseWebAppDecorator.render(BaseWebAppDecorator.java:37)
      	at com.opensymphony.sitemesh.webapp.SiteMeshFilter.doFilter(SiteMeshFilter.java:92)
      	at com.atlassian.jira.web.filters.SitemeshPageFilter.doFilter(SitemeshPageFilter.java:112)
      	... 10 filtered
      	at com.almworks.jira.structure.util.servlet.AbstractStructureServletFilter.doHttpFilter(AbstractStructureServletFilter.java:46)
      	at com.almworks.jira.structure.util.servlet.AbstractStructureServletFilter.doFilter(AbstractStructureServletFilter.java:34)
      	... 3 filtered
      	at com.almworks.jira.structure.util.servlet.AbstractStructureServletFilter.doHttpFilter(AbstractStructureServletFilter.java:46)
      	at com.almworks.jira.structure.util.servlet.AbstractStructureServletFilter.doFilter(AbstractStructureServletFilter.java:34)
      	... 3 filtered
      	at com.almworks.jira.structure.util.servlet.AbstractStructureServletFilter.doHttpFilter(AbstractStructureServletFilter.java:46)
      	at com.almworks.jira.structure.util.servlet.AbstractStructureServletFilter.doFilter(AbstractStructureServletFilter.java:34)
      	... 18 filtered
      	at com.atlassian.jira.security.JiraSecurityFilter.lambda$doFilter$0(JiraSecurityFilter.java:76)
      	... 1 filtered
      	at com.atlassian.jira.security.JiraSecurityFilter.doFilter(JiraSecurityFilter.java:74)
      	... 40 filtered
      	at com.atlassian.jira.servermetrics.CorrelationIdPopulatorFilter.doFilter(CorrelationIdPopulatorFilter.java:30)
      	... 5 filtered
      	at com.atlassian.servicedesk.internal.web.CustomerContextSettingFilter.lambda$invokeFilterChain$0(CustomerContextSettingFilter.java:181)
      	at com.atlassian.servicedesk.internal.utils.context.ReentrantThreadLocalBasedCodeContext.rteInvoke(ReentrantThreadLocalBasedCodeContext.java:142)
      	at com.atlassian.servicedesk.internal.utils.context.ReentrantThreadLocalBasedCodeContext.runOutOfContext(ReentrantThreadLocalBasedCodeContext.java:94)
      	at com.atlassian.servicedesk.internal.utils.context.CustomerContextServiceImpl.runOutOfCustomerContext(CustomerContextServiceImpl.java:64)
      	at com.atlassian.servicedesk.internal.web.CustomerContextSettingFilter.outOfCustomerContext(CustomerContextSettingFilter.java:174)
      	at com.atlassian.servicedesk.internal.web.CustomerContextSettingFilter.doFilterImpl(CustomerContextSettingFilter.java:130)
      	at com.atlassian.servicedesk.internal.web.CustomerContextSettingFilter.doFilter(CustomerContextSettingFilter.java:121)
      	... 9 filtered
      	at com.atlassian.jwt.internal.servlet.JwtAuthFilter.doFilter(JwtAuthFilter.java:32)
      	... 3 filtered
      	at com.atlassian.web.servlet.plugin.request.RedirectInterceptingFilter.doFilter(RedirectInterceptingFilter.java:21)
      	... 4 filtered
      	at com.atlassian.web.servlet.plugin.LocationCleanerFilter.doFilter(LocationCleanerFilter.java:36)
      	... 29 filtered
      	at com.atlassian.jira.servermetrics.MetricsCollectorFilter.doFilter(MetricsCollectorFilter.java:25)
      	... 27 filtered
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
      	at java.lang.Thread.run(Thread.java:745)
      ...
      Caused by: com.atlassian.jira.exception.DataAccessException: org.ofbiz.core.entity.GenericDataSourceException: SQL Exception while executing the following:UPDATE clusterlockstatus SET  UPDATE_TIME = ? ,  LOCKED_BY_NODE = ?  WHERE LOCK_NAME=? AND LOCKED_BY_NODE=? 
      (Communications link failure: 
      The last packet successfully received from the server was 1,339,594 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.)
      	at com.atlassian.jira.ofbiz.DefaultOfBizDelegator.bulkUpdateByAnd(DefaultOfBizDelegator.java:570)
      	at com.atlassian.jira.ofbiz.WrappingOfBizDelegator.bulkUpdateByAnd(WrappingOfBizDelegator.java:290)
      	at com.atlassian.jira.entity.Update$WhereContext.execute(Update.java:126)
      	at com.atlassian.jira.entity.Update$WhereContext.execute(Update.java:122)
      	at com.atlassian.jira.cluster.lock.JiraClusterLockDao.unlock(JiraClusterLockDao.java:82)
      	at com.atlassian.beehive.db.DatabaseClusterLock.unlock(DatabaseClusterLock.java:372)
      	at com.atlassian.jira.config.DefaultReindexMessageManager.getMessage(DefaultReindexMessageManager.java:190)
      	at org.apache.jsp.decorators.admin_jsp._jspService(admin_jsp.java:485)
      	at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
      	at javax.servlet.http.HttpServlet.service(HttpServlet.java:729)
      	at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:438)
      	... 208 more
      Caused by: org.ofbiz.core.entity.GenericDataSourceException: SQL Exception while executing the following:UPDATE clusterlockstatus SET  UPDATE_TIME = ? ,  LOCKED_BY_NODE = ?  WHERE LOCK_NAME=? AND LOCKED_BY_NODE=? (Communications link failure
      
      The last packet successfully received from the server was 1,339,594 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.)
      	at org.ofbiz.core.entity.jdbc.SQLProcessor.executeUpdate(SQLProcessor.java:683)
      	at com.atlassian.jira.ofbiz.DefaultOfBizDelegator.bulkUpdateByAnd(DefaultOfBizDelegator.java:565)
      	... 219 more
      The last packet successfully received from the server was 1,339,594 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.
      	... 2 filtered
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
      	at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
      	at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989)
      	at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3559)
      	at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3459)
      	at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3900)
      	at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
      	at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
      	at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2490)
      	at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
      	at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2079)
      	at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2013)
      	at com.mysql.jdbc.PreparedStatement.executeLargeUpdate(PreparedStatement.java:5104)
      	at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1998)
      	at org.apache.commons.dbcp2.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:98)
      	at org.apache.commons.dbcp2.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:98)
      	at org.ofbiz.core.entity.jdbc.SQLProcessor.executeUpdate(SQLProcessor.java:673)
      	... 220 more
      Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
      	at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3011)
      	at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3469)
      	... 233 more
      

      Key part:

      Caused by: com.atlassian.jira.exception.DataAccessException: org.ofbiz.core.entity.GenericDataSourceException: SQL Exception while executing the following:UPDATE clusterlockstatus SET  UPDATE_TIME = ? ,  LOCKED_BY_NODE = ?  WHERE LOCK_NAME=? AND LOCKED_BY_NODE=? 
      (Communications link failure: 
      The last packet successfully received from the server was 1,339,594 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.)
      	at com.atlassian.jira.ofbiz.DefaultOfBizDelegator.bulkUpdateByAnd(DefaultOfBizDelegator.java:570)
      	at com.atlassian.jira.ofbiz.WrappingOfBizDelegator.bulkUpdateByAnd(WrappingOfBizDelegator.java:290)
      	at com.atlassian.jira.entity.Update$WhereContext.execute(Update.java:126)
      	at com.atlassian.jira.entity.Update$WhereContext.execute(Update.java:122)
      	at com.atlassian.jira.cluster.lock.JiraClusterLockDao.unlock(JiraClusterLockDao.java:82)
      	at com.atlassian.beehive.db.DatabaseClusterLock.unlock(DatabaseClusterLock.java:372)
      	at com.atlassian.jira.config.DefaultReindexMessageManager.getMessage(DefaultReindexMessageManager.java:190)
      ...
      

       

      Workaround

      Restart the problematic node. See the following KB article for full instructions on this workaround:
      https://confluence.atlassian.com/jirakb/jira-data-center-functionalities-loss-due-to-cluster-wide-lock-942860754.html

      Attachments

        Issue Links

          Activity

            People

              klopacinski Karol Lopacinski
              ayakovlev@atlassian.com Andriy Yakovlev [Atlassian]
              Votes:
              27 Vote for this issue
              Watchers:
              55 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: