Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-66597

JIRA DC might lose Cluster lock due database connectivity problems

    • 7.02
    • 53
    • Severity 2 - Major
    • 149
    • Hide
      Atlassian Update - 17th June 2019

      Dear Jira users,

      we're glad to announce that this issue will be addressed in our upcoming 8.3.0 release!

      Cluster locks have been turned into leases that automatically expire when no longer used, making the situation described in this issue automatically recoverable.

      Thank you,
      Grazyna Kaszkur

      Product Manager,
      Jira Server and Data Center

      Show
      Atlassian Update - 17th June 2019 Dear Jira users, we're glad to announce that this issue will be addressed in our upcoming 8.3.0 release! Cluster locks have been turned into leases that automatically expire when no longer used, making the situation described in this issue automatically recoverable. Thank you, Grazyna Kaszkur
 Product Manager, Jira Server and Data Center

      Summary

      JIRA DC might lose cluster lock while trying to unlock the lock if there is database connectivity problems .

      That makes whole cluster stuck for some operations/actions that needs that lock.
      Specifically we saw problem with:

      • com.atlassian.jira.config.DefaultReindexMessageManager.messageLock lock
      • com.atlassian.upm.impl.PluginManagerPluginAsynchronousTaskStatusStoreImpl lock
      • NotificationBatchScheduler.sd_batch_sending_lock lock
      • sla_issue_update_xx lock
      • the Service Desk Mail Puller job com.atlassian.jira.internal.mail.services.MailPullerJobRunner

      So there is a possible chain of events:

      1. Node1 obtains the lock in thread1
        • Node1 registers it in DB and internal JVM structure.
      2. All other nodes see lock and wait
        • Thread1 unlocks the lock. While updating db it fails and release only local ref lock
      3. All other nodes see lock and wait
      4. Lock check at Node1 see that local ref is NULL and DB lock is present, so does nothing
      5. All stuck

      Environment

      • JIRA Datacenter

      Steps to Reproduce

      1. Set up a Datacenter JIRA instance with 2 nodes
      2. Initiate action that would obtain the lock
      3. Break connection to the database while this node has a clusterlock and tries to unlock it

      Expected Results

      JIRA will recover and do the unlock query once connected back to the database

      Actual Results

      JIRA DC lost the cluster lock. The below exception is thrown in the atlassian-jira.log file:

      2018-01-19 09:09:31,377 http-nio-8080-exec-1 ERROR bob 300x234323428x11 18tz5ra 10.0.0.1 /secure/admin/user/UserBrowser.jspa [o.a.c.c.C.[Catalina].[localhost].[/jira]] Unhandled exception occurred whilst decorating page
      java.lang.RuntimeException: org.apache.jasper.JasperException: com.atlassian.jira.exception.DataAccessException: org.ofbiz.core.entity.GenericDataSourceException: SQL Exception while executing the following:UPDATE clusterlockstatus SET  UPDATE_TIME = ? ,  LOCKED_BY_NODE = ?  WHERE LOCK_NAME=? AND LOCKED_BY_NODE=? (Communications link failure
      
      The last packet successfully received from the server was 1,339,594 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.)
      	at com.opensymphony.sitemesh.webapp.decorator.BaseWebAppDecorator.render(BaseWebAppDecorator.java:37)
      	at com.opensymphony.sitemesh.webapp.SiteMeshFilter.doFilter(SiteMeshFilter.java:92)
      	at com.atlassian.jira.web.filters.SitemeshPageFilter.doFilter(SitemeshPageFilter.java:112)
      	... 10 filtered
      	at com.almworks.jira.structure.util.servlet.AbstractStructureServletFilter.doHttpFilter(AbstractStructureServletFilter.java:46)
      	at com.almworks.jira.structure.util.servlet.AbstractStructureServletFilter.doFilter(AbstractStructureServletFilter.java:34)
      	... 3 filtered
      	at com.almworks.jira.structure.util.servlet.AbstractStructureServletFilter.doHttpFilter(AbstractStructureServletFilter.java:46)
      	at com.almworks.jira.structure.util.servlet.AbstractStructureServletFilter.doFilter(AbstractStructureServletFilter.java:34)
      	... 3 filtered
      	at com.almworks.jira.structure.util.servlet.AbstractStructureServletFilter.doHttpFilter(AbstractStructureServletFilter.java:46)
      	at com.almworks.jira.structure.util.servlet.AbstractStructureServletFilter.doFilter(AbstractStructureServletFilter.java:34)
      	... 18 filtered
      	at com.atlassian.jira.security.JiraSecurityFilter.lambda$doFilter$0(JiraSecurityFilter.java:76)
      	... 1 filtered
      	at com.atlassian.jira.security.JiraSecurityFilter.doFilter(JiraSecurityFilter.java:74)
      	... 40 filtered
      	at com.atlassian.jira.servermetrics.CorrelationIdPopulatorFilter.doFilter(CorrelationIdPopulatorFilter.java:30)
      	... 5 filtered
      	at com.atlassian.servicedesk.internal.web.CustomerContextSettingFilter.lambda$invokeFilterChain$0(CustomerContextSettingFilter.java:181)
      	at com.atlassian.servicedesk.internal.utils.context.ReentrantThreadLocalBasedCodeContext.rteInvoke(ReentrantThreadLocalBasedCodeContext.java:142)
      	at com.atlassian.servicedesk.internal.utils.context.ReentrantThreadLocalBasedCodeContext.runOutOfContext(ReentrantThreadLocalBasedCodeContext.java:94)
      	at com.atlassian.servicedesk.internal.utils.context.CustomerContextServiceImpl.runOutOfCustomerContext(CustomerContextServiceImpl.java:64)
      	at com.atlassian.servicedesk.internal.web.CustomerContextSettingFilter.outOfCustomerContext(CustomerContextSettingFilter.java:174)
      	at com.atlassian.servicedesk.internal.web.CustomerContextSettingFilter.doFilterImpl(CustomerContextSettingFilter.java:130)
      	at com.atlassian.servicedesk.internal.web.CustomerContextSettingFilter.doFilter(CustomerContextSettingFilter.java:121)
      	... 9 filtered
      	at com.atlassian.jwt.internal.servlet.JwtAuthFilter.doFilter(JwtAuthFilter.java:32)
      	... 3 filtered
      	at com.atlassian.web.servlet.plugin.request.RedirectInterceptingFilter.doFilter(RedirectInterceptingFilter.java:21)
      	... 4 filtered
      	at com.atlassian.web.servlet.plugin.LocationCleanerFilter.doFilter(LocationCleanerFilter.java:36)
      	... 29 filtered
      	at com.atlassian.jira.servermetrics.MetricsCollectorFilter.doFilter(MetricsCollectorFilter.java:25)
      	... 27 filtered
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
      	at java.lang.Thread.run(Thread.java:745)
      ...
      Caused by: com.atlassian.jira.exception.DataAccessException: org.ofbiz.core.entity.GenericDataSourceException: SQL Exception while executing the following:UPDATE clusterlockstatus SET  UPDATE_TIME = ? ,  LOCKED_BY_NODE = ?  WHERE LOCK_NAME=? AND LOCKED_BY_NODE=? 
      (Communications link failure: 
      The last packet successfully received from the server was 1,339,594 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.)
      	at com.atlassian.jira.ofbiz.DefaultOfBizDelegator.bulkUpdateByAnd(DefaultOfBizDelegator.java:570)
      	at com.atlassian.jira.ofbiz.WrappingOfBizDelegator.bulkUpdateByAnd(WrappingOfBizDelegator.java:290)
      	at com.atlassian.jira.entity.Update$WhereContext.execute(Update.java:126)
      	at com.atlassian.jira.entity.Update$WhereContext.execute(Update.java:122)
      	at com.atlassian.jira.cluster.lock.JiraClusterLockDao.unlock(JiraClusterLockDao.java:82)
      	at com.atlassian.beehive.db.DatabaseClusterLock.unlock(DatabaseClusterLock.java:372)
      	at com.atlassian.jira.config.DefaultReindexMessageManager.getMessage(DefaultReindexMessageManager.java:190)
      	at org.apache.jsp.decorators.admin_jsp._jspService(admin_jsp.java:485)
      	at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
      	at javax.servlet.http.HttpServlet.service(HttpServlet.java:729)
      	at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:438)
      	... 208 more
      Caused by: org.ofbiz.core.entity.GenericDataSourceException: SQL Exception while executing the following:UPDATE clusterlockstatus SET  UPDATE_TIME = ? ,  LOCKED_BY_NODE = ?  WHERE LOCK_NAME=? AND LOCKED_BY_NODE=? (Communications link failure
      
      The last packet successfully received from the server was 1,339,594 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.)
      	at org.ofbiz.core.entity.jdbc.SQLProcessor.executeUpdate(SQLProcessor.java:683)
      	at com.atlassian.jira.ofbiz.DefaultOfBizDelegator.bulkUpdateByAnd(DefaultOfBizDelegator.java:565)
      	... 219 more
      The last packet successfully received from the server was 1,339,594 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.
      	... 2 filtered
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
      	at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
      	at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989)
      	at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3559)
      	at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3459)
      	at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3900)
      	at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
      	at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
      	at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2490)
      	at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
      	at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2079)
      	at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2013)
      	at com.mysql.jdbc.PreparedStatement.executeLargeUpdate(PreparedStatement.java:5104)
      	at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1998)
      	at org.apache.commons.dbcp2.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:98)
      	at org.apache.commons.dbcp2.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:98)
      	at org.ofbiz.core.entity.jdbc.SQLProcessor.executeUpdate(SQLProcessor.java:673)
      	... 220 more
      Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
      	at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3011)
      	at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3469)
      	... 233 more
      

      Key part:

      Caused by: com.atlassian.jira.exception.DataAccessException: org.ofbiz.core.entity.GenericDataSourceException: SQL Exception while executing the following:UPDATE clusterlockstatus SET  UPDATE_TIME = ? ,  LOCKED_BY_NODE = ?  WHERE LOCK_NAME=? AND LOCKED_BY_NODE=? 
      (Communications link failure: 
      The last packet successfully received from the server was 1,339,594 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.)
      	at com.atlassian.jira.ofbiz.DefaultOfBizDelegator.bulkUpdateByAnd(DefaultOfBizDelegator.java:570)
      	at com.atlassian.jira.ofbiz.WrappingOfBizDelegator.bulkUpdateByAnd(WrappingOfBizDelegator.java:290)
      	at com.atlassian.jira.entity.Update$WhereContext.execute(Update.java:126)
      	at com.atlassian.jira.entity.Update$WhereContext.execute(Update.java:122)
      	at com.atlassian.jira.cluster.lock.JiraClusterLockDao.unlock(JiraClusterLockDao.java:82)
      	at com.atlassian.beehive.db.DatabaseClusterLock.unlock(DatabaseClusterLock.java:372)
      	at com.atlassian.jira.config.DefaultReindexMessageManager.getMessage(DefaultReindexMessageManager.java:190)
      ...
      

       

      Workaround

      Restart the problematic node. See the following KB article for full instructions on this workaround:
      https://confluence.atlassian.com/jirakb/jira-data-center-functionalities-loss-due-to-cluster-wide-lock-942860754.html

          Form Name

            [JRASERVER-66597] JIRA DC might lose Cluster lock due database connectivity problems

            Hi,

            Thank you for the recent comments.

            We want to inform you that we have decided not to backport this fix to the 7.13 Enterprise Release version. While we’ve been working to backport this fix in recent months, after careful analysis, we ultimately feel that the change would bring too much risk to the stability of the release.

            We have made this tough call due to the fact that the changes introduced in 8.3 were complex, scattered over many libraries and codebases. Additionally, backporting this particular change over the platform release involved many moving parts. All this made us realise that the risk we might bring with this change is far too severe and we might cause even more problems if we decide to go forward with this backport.

            We are aware that you were waiting for this update and our decision may be disappointing to you. We would like to reassure you that we will keep backporting crucial bugs to our long term support versions as we have stated here: https://confluence.atlassian.com/enterprise/atlassian-enterprise-releases-948227420.html, simply in some cases this brings more risks than benefits that we hadn’t originally anticipated.

            If you are impacted by this bug, we recommend upgrading to our newest Enterprise release, version 8.5.x. Please follow our upgrade guide for Jira 8.5: https://confluence.atlassian.com/jiracore/jira-8-5-enterprise-release-upgrade-guide-976161392.html.

            Best,

            Jira Server and Data Center product management

            Grazyna Kaszkur added a comment - Hi, Thank you for the recent comments. We want to inform you that we have decided not to backport this fix to the 7.13 Enterprise Release version. While we’ve been working to backport this fix in recent months, after careful analysis, we ultimately feel that the change would bring too much risk to the stability of the release. We have made this tough call due to the fact that the changes introduced in 8.3 were complex, scattered over many libraries and codebases. Additionally, backporting this particular change over the platform release involved many moving parts. All this made us realise that the risk we might bring with this change is far too severe and we might cause even more problems if we decide to go forward with this backport. We are aware that you were waiting for this update and our decision may be disappointing to you. We would like to reassure you that we will keep backporting crucial bugs to our long term support versions as we have stated here: https://confluence.atlassian.com/enterprise/atlassian-enterprise-releases-948227420.html , simply in some cases this brings more risks than benefits that we hadn’t originally anticipated. If you are impacted by this bug, we recommend upgrading to our newest Enterprise release, version 8.5.x. Please follow our upgrade guide for Jira 8.5: https://confluence.atlassian.com/jiracore/jira-8-5-enterprise-release-upgrade-guide-976161392.html . Best, Jira Server and Data Center product management

            Is there any update regarding this issue?

            Yaakov Shami added a comment - Is there any update regarding this issue?

            There will be correction for version 7.13.x ER?

            Vinicius Almeida [E-core] added a comment - There will be correction for version 7.13.x ER?

            Hey gajan.umapathy2004152137,
            I wanted to confirm that we are aiming to backport this fix to 7.13 Enterprise Release, however as ayakovlev@atlassian.com mentioned we do not have a specific 7.13 ER version yet. This is due to the fact that we have decided to extend the soaking period for this fix as a stability precaution before the backport.
            Please expect another update from our end in October 2019. Thank you.

            Grazyna Kaszkur added a comment - Hey gajan.umapathy2004152137 , I wanted to confirm that we are aiming to backport this fix to 7.13 Enterprise Release, however as  ayakovlev@atlassian.com  mentioned we do not have a specific 7.13 ER version yet. This is due to the fact that we have decided to extend the soaking period for this fix as a stability precaution before the backport. Please expect another update from our end in October 2019. Thank you.

            Any update on the 7.13.x ER fix for this bug?

            Gaj Umapathy added a comment - Any update on the 7.13.x ER fix for this bug?

            Thanks for the update Andriy. Kindly keep us posted when Atlassian decides to release the new 7.13.x version with this fix.

            Saurabh Sharma (Inactive) added a comment - Thanks for the update Andriy. Kindly keep us posted when Atlassian decides to release the new 7.13.x version with this fix.

            Hey ssharma
            We don't have a specific 7.13 ER version yet.
            To answer your specific question, 7.13.5 doesn't have the fix.

            Andriy Yakovlev [Atlassian] added a comment - Hey ssharma We don't have a specific 7.13 ER version yet. To answer your specific question, 7.13.5 doesn't have the fix.

            Hi Team,

            Can you please clarify this bug is getting fixed in which version of 7.13 ER. We are going to upgrade to 7.13.5 by end of July, is this fix available in this version of JIRA ?

            Saurabh Sharma (Inactive) added a comment - Hi Team, Can you please clarify this bug is getting fixed in which version of 7.13 ER. We are going to upgrade to 7.13.5 by end of July, is this fix available in this version of JIRA ?

            Hi , is there any update on the progress to fix this issue ?

            Dashrath L Kale added a comment - Hi , is there any update on the progress to fix this issue ?

            Hi everyone,

            We have recently reviewed this issue and the overall interest in the problem. We acknowledge that the issue has a high impact on the software teams and agree that it has to be prioritised in the Short Term which corresponds the current status of the bug.

            We would aim to address the issue within the next 3 months or so, yet please do not consider this a deadline to fix the bug, but rather as an estimated timeline. Please refer to Atlassian Bugfix Policy for the details on how we prioritise and deliver fixes in our product.

            Atlassian will continue to watch this issue for further updates, so please feel free to share your thoughts in the comments.

            Thank you,
            Grażyna Kaszkur
            Jira Server Product Management

            Grazyna Kaszkur added a comment - Hi everyone, We have recently reviewed this issue and the overall interest in the problem. We acknowledge that the issue has a high impact on the software teams and agree that it has to be prioritised in the Short Term which corresponds the current status of the bug. We would aim to address the issue within the next 3 months or so, yet please do not consider this a deadline to fix the bug, but rather as an estimated timeline. Please refer to Atlassian Bugfix Policy for the details on how we prioritise and deliver fixes in our product. Atlassian will continue to watch this issue for further updates, so please feel free to share your thoughts in the comments. Thank you, Grażyna Kaszkur Jira Server Product Management

              klopacinski Karol Lopacinski
              ayakovlev@atlassian.com Andriy Yakovlev [Atlassian]
              Affected customers:
              27 This affects my team
              Watchers:
              55 Start watching this issue

                Created:
                Updated:
                Resolved: