-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Low
-
Affects Version/s: 8.0.0, 8.0.1, 8.0.2, 8.0.3, 8.1.1, 8.0.4, 8.0.5, 8.0.6
-
Component/s: Bamboo Release
-
None
-
1
-
Severity 2 - Major
Problem
A Bamboo node in HA does not fully exit when a DB connection is lost. Once the standby node loses connection to the database, the Java process doesn't exit fully, leaving the application in an unknown, unrecoverable state, not prone to any operating system recovery as the PID is still up.
If a partial network partition happens, the standby node may be put in an unknown state and may not be able to take over any activities when required.
Environment
Bamboo 8+ (HA)
Steps to Reproduce
- Set up 2-node Bamboo HA cluster
- Start the cluster normally
- Drop the DB connection on the standby node (use a firewall, block on the DB, etc)
- Observe the standby node's ClusterLockBootstrapService process trying to reconnect to the DB a couple of times. After that, it dies and leaves the webserver running. The main PID is still valid and the application is still considered as "running" - so no watchdogs will have ways to initiate any recovery action
- The non-active (standby) node won't recover if the DB connection is restored
Expected Results
The Bamboo standby node should either:
- Try to connect to the DB for as long as possible during startup. If already connected but received a reset, try reconnecting for only a specific amount of time then die
or
- Exit the main Java process immediately after a DB connection fails, so any OS service watchdogs could have decided to whether restart the application or report it
or
- A combination of the two options above
Actual Results
The ClusterLockBootstrapService process dies after 3 x 30000ms checks and leaves the webserver running without supervision.
2022-01-18 04:04:23,398 FATAL [localhost-startStop-1] [DefaultBootstrapManager] Failed to acquire cluster lock
java.lang.RuntimeException: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
at com.google.common.base.Throwables.propagate(Throwables.java:241)
at com.atlassian.bamboo.beehive.AbstractBambooRawJdbcDao.getBootstrapDatabaseConnection(AbstractBambooRawJdbcDao.java:116)
at com.atlassian.bamboo.beehive.AbstractBambooRawJdbcDao.supplierWithConnection(AbstractBambooRawJdbcDao.java:71)
at com.atlassian.bamboo.beehive.BambooClusterLockRawJdbcDao.tryUpdateAcquireNodeScopedLock(BambooClusterLockRawJdbcDao.java:103)
at com.atlassian.bamboo.beehive.ClusterLockBootstrapServiceImpl.acquirePrimaryNodeClusterLock(ClusterLockBootstrapServiceImpl.java:65)
at com.atlassian.bamboo.setup.DefaultBootstrapManager.performPersistenceUpgrade(DefaultBootstrapManager.java:378)
at com.atlassian.config.bootstrap.DefaultAtlassianBootstrapManager.init(DefaultAtlassianBootstrapManager.java:81)
at com.atlassian.bamboo.setup.BootstrapLoaderListener.contextInitialized(BootstrapLoaderListener.java:155)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4763)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5232)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:183)
at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1427)
at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1417)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:676)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:190)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:155)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
at org.hibernate.hikaricp.internal.HikariCPConnectionProvider.getConnection(HikariCPConnectionProvider.java:77)
at com.atlassian.bamboo.upgrade.BootstrapTaskConnectionProvider.getConnection(BootstrapTaskConnectionProvider.java:82)
at com.atlassian.bamboo.beehive.AbstractBambooRawJdbcDao.getBootstrapDatabaseConnection(AbstractBambooRawJdbcDao.java:113)
... 15 more
Caused by: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:774)
at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:688)
at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:39)
at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:691)
at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:136)
at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:369)
at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:198)
at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:467)
at com.zaxxer.hikari.pool.HikariPool.access$100(HikariPool.java:71)
at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:706)
at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:692)
... 4 more
Caused by: oracle.net.ns.NetException: The Network Adapter could not establish the connection
at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:523)
at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:521)
at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:660)
at oracle.net.ns.NSProtocol.connect(NSProtocol.java:286)
at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1438)
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:518)
... 14 more
Caused by: java.io.IOException: Socket read timed out, socket connect lapse 3 ms. db.internal/192.168.3.96 1521 3 3 true
at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:209)
at oracle.net.nt.ConnOption.connect(ConnOption.java:161)
at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:470)
... 19 more
Caused by: java.io.InterruptedIOException: Socket read timed out
at oracle.net.nt.TimeoutSocketChannel.handleInterrupt(TimeoutSocketChannel.java:307)
at oracle.net.nt.TimeoutSocketChannel.<init>(TimeoutSocketChannel.java:84)
at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:169)
... 21 more
2022-01-18 04:04:23,402 INFO [localhost-startStop-1] [lifecycle] Bamboo home directory: /appl/bamboo_data
2022-01-18 04:04:23,402 INFO [localhost-startStop-1] [lifecycle] Default charset: UTF-8, file name ecoding: utf-8
2022-01-18 04:04:23,415 INFO [localhost-startStop-1] [HikariDataSource] HikariPool-1 - Shutdown initiated...
2022-01-18 04:04:27,924 INFO [localhost-startStop-1] [HikariDataSource] HikariPool-1 - Shutdown completed.
2022-01-18 04:04:27,926 INFO [localhost-startStop-1] [UpgradeLauncher] Upgrades not performed since the application has not been set up yet.
2022-01-18 04:04:27,964 INFO [localhost-startStop-1] [CompressingFilter] CompressingFilter has initialized
2022-01-18 04:04:28,672 INFO [localhost-startStop-1] [SpringObjectFactoryProxy] Deferring initialisation of com.atlassian.bamboo.plugin.xwork.SpringObjectFactoryProxy
2022-01-18 04:04:29,665 INFO [localhost-startStop-1] [Dispatchers] Dispatcher org.apache.struts2.dispatcher.Dispatcher initialised
2022-01-18 04:04:29,756 INFO [localhost-startStop-1] [SpringObjectFactoryProxy] Deferring initialisation of com.atlassian.bamboo.plugin.xwork.SpringObjectFactoryProxy
2022-01-18 04:04:30,456 INFO [localhost-startStop-1] [SpringObjectFactoryProxy] Deferring initialisation of com.atlassian.bamboo.plugin.xwork.SpringObjectFactoryProxy
2022-01-18 04:04:30,562 INFO [localhost-startStop-1] [Dispatchers] Dispatcher org.apache.struts2.dispatcher.Dispatcher initialised
2022-01-18 04:04:30,602 INFO [localhost-startStop-1] [SpringObjectFactoryProxy] Deferring initialisation of com.atlassian.bamboo.plugin.xwork.SpringObjectFactoryProxy
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-53] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-15] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-9] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/rest/api/latest/status' to '/error/errorEvents.action'
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-61] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-75] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,081 INFO [http-nio-127.0.0.1-8085-exec-56] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/rest/api/latest/status' to '/error/errorEvents.action'
2022-01-18 04:04:31,081 INFO [http-nio-127.0.0.1-8085-exec-95] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-24] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,081 INFO [http-nio-127.0.0.1-8085-exec-92] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/rest/api/latest/status' to '/error/errorEvents.action'
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-44] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/rest/api/latest/status' to '/error/errorEvents.action'
2022-01-18 04:04:31,081 INFO [http-nio-127.0.0.1-8085-exec-78] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,082 INFO [http-nio-127.0.0.1-8085-exec-41] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
Workaround
In the Operating system, implement a log watch rule that would match [DefaultBootstrapManager] Failed to acquire cluster lock in <bamboo-home>/logs/atlassian-bamboo.log and trigger a full restart of Bamboo.
Notes
- mentioned in
-
Page Loading...