Bamboo HA node changes to an unknown state after DB connection failure

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: Low
    • 8.2.0, 8.1.4
    • Affects Version/s: 8.0.0, 8.0.1, 8.0.2, 8.0.3, 8.1.1, 8.0.4, 8.0.5, 8.0.6
    • Component/s: Bamboo Release
    • None
    • 1
    • Severity 2 - Major

      Problem

      A Bamboo node in HA does not fully exit when a DB connection is lost. Once the standby node loses connection to the database, the Java process doesn't exit fully, leaving the application in an unknown, unrecoverable state, not prone to any operating system recovery as the PID is still up.

      If a partial network partition happens, the standby node may be put in an unknown state and may not be able to take over any activities when required.

      Environment

      Bamboo 8+ (HA)

      Steps to Reproduce

      1. Set up 2-node Bamboo HA cluster
      2. Start the cluster normally
      3. Drop the DB connection on the standby node (use a firewall, block on the DB, etc)
      4. Observe the standby node's ClusterLockBootstrapService process trying to reconnect to the DB a couple of times. After that, it dies and leaves the webserver running. The main PID is still valid and the application is still considered as "running" - so no watchdogs will have ways to initiate any recovery action
      5. The non-active (standby) node won't recover if the DB connection is restored

      Expected Results

      The Bamboo standby node should either:

      • Try to connect to the DB for as long as possible during startup. If already connected but received a reset, try reconnecting for only a specific amount of time then die

      or

      • Exit the main Java process immediately after a DB connection fails, so any OS service watchdogs could have decided to whether restart the application or report it

      or

      • A combination of the two options above

      Actual Results

      The ClusterLockBootstrapService process dies after 3 x 30000ms checks and leaves the webserver running without supervision.

      2022-01-18 04:04:23,398 FATAL [localhost-startStop-1] [DefaultBootstrapManager] Failed to acquire cluster lock
      java.lang.RuntimeException: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
              at com.google.common.base.Throwables.propagate(Throwables.java:241)
              at com.atlassian.bamboo.beehive.AbstractBambooRawJdbcDao.getBootstrapDatabaseConnection(AbstractBambooRawJdbcDao.java:116)
              at com.atlassian.bamboo.beehive.AbstractBambooRawJdbcDao.supplierWithConnection(AbstractBambooRawJdbcDao.java:71)
              at com.atlassian.bamboo.beehive.BambooClusterLockRawJdbcDao.tryUpdateAcquireNodeScopedLock(BambooClusterLockRawJdbcDao.java:103)
              at com.atlassian.bamboo.beehive.ClusterLockBootstrapServiceImpl.acquirePrimaryNodeClusterLock(ClusterLockBootstrapServiceImpl.java:65)
              at com.atlassian.bamboo.setup.DefaultBootstrapManager.performPersistenceUpgrade(DefaultBootstrapManager.java:378)
              at com.atlassian.config.bootstrap.DefaultAtlassianBootstrapManager.init(DefaultAtlassianBootstrapManager.java:81)
              at com.atlassian.bamboo.setup.BootstrapLoaderListener.contextInitialized(BootstrapLoaderListener.java:155)
              at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4763)
              at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5232)
              at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:183)
              at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1427)
              at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1417)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
              at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:676)
              at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:190)
              at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:155)
              at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
              at org.hibernate.hikaricp.internal.HikariCPConnectionProvider.getConnection(HikariCPConnectionProvider.java:77)
              at com.atlassian.bamboo.upgrade.BootstrapTaskConnectionProvider.getConnection(BootstrapTaskConnectionProvider.java:82)
              at com.atlassian.bamboo.beehive.AbstractBambooRawJdbcDao.getBootstrapDatabaseConnection(AbstractBambooRawJdbcDao.java:113)
              ... 15 more
      Caused by: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
              at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:774)
              at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:688)
              at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:39)
              at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:691)
              at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:136)
              at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:369)
              at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:198)
              at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:467)
              at com.zaxxer.hikari.pool.HikariPool.access$100(HikariPool.java:71)
              at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:706)
              at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:692)
              ... 4 more
      Caused by: oracle.net.ns.NetException: The Network Adapter could not establish the connection
              at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:523)
              at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:521)
              at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:660)
              at oracle.net.ns.NSProtocol.connect(NSProtocol.java:286)
              at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1438)
              at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:518)
              ... 14 more
      Caused by: java.io.IOException: Socket read timed out, socket connect lapse 3 ms. db.internal/192.168.3.96 1521 3 3 true
              at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:209)
              at oracle.net.nt.ConnOption.connect(ConnOption.java:161)
              at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:470)
              ... 19 more
      Caused by: java.io.InterruptedIOException: Socket read timed out
              at oracle.net.nt.TimeoutSocketChannel.handleInterrupt(TimeoutSocketChannel.java:307)
              at oracle.net.nt.TimeoutSocketChannel.<init>(TimeoutSocketChannel.java:84)
              at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:169)
              ... 21 more
      2022-01-18 04:04:23,402 INFO [localhost-startStop-1] [lifecycle] Bamboo home directory: /appl/bamboo_data
      2022-01-18 04:04:23,402 INFO [localhost-startStop-1] [lifecycle] Default charset: UTF-8, file name ecoding: utf-8
      2022-01-18 04:04:23,415 INFO [localhost-startStop-1] [HikariDataSource] HikariPool-1 - Shutdown initiated...
      2022-01-18 04:04:27,924 INFO [localhost-startStop-1] [HikariDataSource] HikariPool-1 - Shutdown completed.
      2022-01-18 04:04:27,926 INFO [localhost-startStop-1] [UpgradeLauncher] Upgrades not performed since the application has not been set up yet.
      2022-01-18 04:04:27,964 INFO [localhost-startStop-1] [CompressingFilter] CompressingFilter has initialized
      2022-01-18 04:04:28,672 INFO [localhost-startStop-1] [SpringObjectFactoryProxy] Deferring initialisation of com.atlassian.bamboo.plugin.xwork.SpringObjectFactoryProxy
      2022-01-18 04:04:29,665 INFO [localhost-startStop-1] [Dispatchers] Dispatcher org.apache.struts2.dispatcher.Dispatcher initialised
      2022-01-18 04:04:29,756 INFO [localhost-startStop-1] [SpringObjectFactoryProxy] Deferring initialisation of com.atlassian.bamboo.plugin.xwork.SpringObjectFactoryProxy
      2022-01-18 04:04:30,456 INFO [localhost-startStop-1] [SpringObjectFactoryProxy] Deferring initialisation of com.atlassian.bamboo.plugin.xwork.SpringObjectFactoryProxy
      2022-01-18 04:04:30,562 INFO [localhost-startStop-1] [Dispatchers] Dispatcher org.apache.struts2.dispatcher.Dispatcher initialised
      2022-01-18 04:04:30,602 INFO [localhost-startStop-1] [SpringObjectFactoryProxy] Deferring initialisation of com.atlassian.bamboo.plugin.xwork.SpringObjectFactoryProxy
      2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-53] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
      2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-15] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
      2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-9] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/rest/api/latest/status' to '/error/errorEvents.action'
      2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-61] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
      2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-75] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
      2022-01-18 04:04:31,081 INFO [http-nio-127.0.0.1-8085-exec-56] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/rest/api/latest/status' to '/error/errorEvents.action'
      2022-01-18 04:04:31,081 INFO [http-nio-127.0.0.1-8085-exec-95] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
      2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-24] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
      2022-01-18 04:04:31,081 INFO [http-nio-127.0.0.1-8085-exec-92] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/rest/api/latest/status' to '/error/errorEvents.action'
      2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-44] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/rest/api/latest/status' to '/error/errorEvents.action'
      2022-01-18 04:04:31,081 INFO [http-nio-127.0.0.1-8085-exec-78] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
      2022-01-18 04:04:31,082 INFO [http-nio-127.0.0.1-8085-exec-41] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
      

      Workaround

      In the Operating system, implement a log watch rule that would match [DefaultBootstrapManager] Failed to acquire cluster lock in <bamboo-home>/logs/atlassian-bamboo.log and trigger a full restart of Bamboo.

      Notes

            Assignee:
            Jan Majkutewicz (Inactive)
            Reporter:
            Eduardo Alvarenga (Inactive)
            Votes:
            9 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: