Loading...

Type: Bug
Resolution: Fixed
Priority: Low
Fix Version/s: 8.2.0, 8.1.4
Affects Version/s: 8.0.0, 8.0.1, 8.0.2, 8.0.3, 8.1.1, 8.0.4, 8.0.5, 8.0.6
Component/s: Bamboo Release
Labels:
None

Support reference count:
1
Symptom Severity:
Severity 2 - Major

Problem

A Bamboo node in HA does not fully exit when a DB connection is lost. Once the standby node loses connection to the database, the Java process doesn't exit fully, leaving the application in an unknown, unrecoverable state, not prone to any operating system recovery as the PID is still up.

If a partial network partition happens, the standby node may be put in an unknown state and may not be able to take over any activities when required.

Environment

Bamboo 8+ (HA)

Steps to Reproduce

Set up 2-node Bamboo HA cluster
Start the cluster normally
Drop the DB connection on the standby node (use a firewall, block on the DB, etc)
Observe the standby node's ClusterLockBootstrapService process trying to reconnect to the DB a couple of times. After that, it dies and leaves the webserver running. The main PID is still valid and the application is still considered as "running" - so no watchdogs will have ways to initiate any recovery action
The non-active (standby) node won't recover if the DB connection is restored

Expected Results

The Bamboo standby node should either:

Try to connect to the DB for as long as possible during startup. If already connected but received a reset, try reconnecting for only a specific amount of time then die

or

Exit the main Java process immediately after a DB connection fails, so any OS service watchdogs could have decided to whether restart the application or report it

or

A combination of the two options above

Actual Results

The ClusterLockBootstrapService process dies after 3 x 30000ms checks and leaves the webserver running without supervision.

2022-01-18 04:04:23,398 FATAL [localhost-startStop-1] [DefaultBootstrapManager] Failed to acquire cluster lock
java.lang.RuntimeException: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
        at com.google.common.base.Throwables.propagate(Throwables.java:241)
        at com.atlassian.bamboo.beehive.AbstractBambooRawJdbcDao.getBootstrapDatabaseConnection(AbstractBambooRawJdbcDao.java:116)
        at com.atlassian.bamboo.beehive.AbstractBambooRawJdbcDao.supplierWithConnection(AbstractBambooRawJdbcDao.java:71)
        at com.atlassian.bamboo.beehive.BambooClusterLockRawJdbcDao.tryUpdateAcquireNodeScopedLock(BambooClusterLockRawJdbcDao.java:103)
        at com.atlassian.bamboo.beehive.ClusterLockBootstrapServiceImpl.acquirePrimaryNodeClusterLock(ClusterLockBootstrapServiceImpl.java:65)
        at com.atlassian.bamboo.setup.DefaultBootstrapManager.performPersistenceUpgrade(DefaultBootstrapManager.java:378)
        at com.atlassian.config.bootstrap.DefaultAtlassianBootstrapManager.init(DefaultAtlassianBootstrapManager.java:81)
        at com.atlassian.bamboo.setup.BootstrapLoaderListener.contextInitialized(BootstrapLoaderListener.java:155)
        at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4763)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5232)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:183)
        at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1427)
        at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1417)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
        at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:676)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:190)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:155)
        at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
        at org.hibernate.hikaricp.internal.HikariCPConnectionProvider.getConnection(HikariCPConnectionProvider.java:77)
        at com.atlassian.bamboo.upgrade.BootstrapTaskConnectionProvider.getConnection(BootstrapTaskConnectionProvider.java:82)
        at com.atlassian.bamboo.beehive.AbstractBambooRawJdbcDao.getBootstrapDatabaseConnection(AbstractBambooRawJdbcDao.java:113)
        ... 15 more
Caused by: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
        at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:774)
        at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:688)
        at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:39)
        at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:691)
        at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:136)
        at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:369)
        at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:198)
        at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:467)
        at com.zaxxer.hikari.pool.HikariPool.access$100(HikariPool.java:71)
        at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:706)
        at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:692)
        ... 4 more
Caused by: oracle.net.ns.NetException: The Network Adapter could not establish the connection
        at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:523)
        at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:521)
        at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:660)
        at oracle.net.ns.NSProtocol.connect(NSProtocol.java:286)
        at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1438)
        at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:518)
        ... 14 more
Caused by: java.io.IOException: Socket read timed out, socket connect lapse 3 ms. db.internal/192.168.3.96 1521 3 3 true
        at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:209)
        at oracle.net.nt.ConnOption.connect(ConnOption.java:161)
        at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:470)
        ... 19 more
Caused by: java.io.InterruptedIOException: Socket read timed out
        at oracle.net.nt.TimeoutSocketChannel.handleInterrupt(TimeoutSocketChannel.java:307)
        at oracle.net.nt.TimeoutSocketChannel.<init>(TimeoutSocketChannel.java:84)
        at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:169)
        ... 21 more
2022-01-18 04:04:23,402 INFO [localhost-startStop-1] [lifecycle] Bamboo home directory: /appl/bamboo_data
2022-01-18 04:04:23,402 INFO [localhost-startStop-1] [lifecycle] Default charset: UTF-8, file name ecoding: utf-8
2022-01-18 04:04:23,415 INFO [localhost-startStop-1] [HikariDataSource] HikariPool-1 - Shutdown initiated...
2022-01-18 04:04:27,924 INFO [localhost-startStop-1] [HikariDataSource] HikariPool-1 - Shutdown completed.
2022-01-18 04:04:27,926 INFO [localhost-startStop-1] [UpgradeLauncher] Upgrades not performed since the application has not been set up yet.
2022-01-18 04:04:27,964 INFO [localhost-startStop-1] [CompressingFilter] CompressingFilter has initialized
2022-01-18 04:04:28,672 INFO [localhost-startStop-1] [SpringObjectFactoryProxy] Deferring initialisation of com.atlassian.bamboo.plugin.xwork.SpringObjectFactoryProxy
2022-01-18 04:04:29,665 INFO [localhost-startStop-1] [Dispatchers] Dispatcher org.apache.struts2.dispatcher.Dispatcher initialised
2022-01-18 04:04:29,756 INFO [localhost-startStop-1] [SpringObjectFactoryProxy] Deferring initialisation of com.atlassian.bamboo.plugin.xwork.SpringObjectFactoryProxy
2022-01-18 04:04:30,456 INFO [localhost-startStop-1] [SpringObjectFactoryProxy] Deferring initialisation of com.atlassian.bamboo.plugin.xwork.SpringObjectFactoryProxy
2022-01-18 04:04:30,562 INFO [localhost-startStop-1] [Dispatchers] Dispatcher org.apache.struts2.dispatcher.Dispatcher initialised
2022-01-18 04:04:30,602 INFO [localhost-startStop-1] [SpringObjectFactoryProxy] Deferring initialisation of com.atlassian.bamboo.plugin.xwork.SpringObjectFactoryProxy
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-53] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-15] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-9] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/rest/api/latest/status' to '/error/errorEvents.action'
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-61] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-75] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,081 INFO [http-nio-127.0.0.1-8085-exec-56] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/rest/api/latest/status' to '/error/errorEvents.action'
2022-01-18 04:04:31,081 INFO [http-nio-127.0.0.1-8085-exec-95] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-24] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,081 INFO [http-nio-127.0.0.1-8085-exec-92] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/rest/api/latest/status' to '/error/errorEvents.action'
2022-01-18 04:04:31,080 INFO [http-nio-127.0.0.1-8085-exec-44] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/rest/api/latest/status' to '/error/errorEvents.action'
2022-01-18 04:04:31,081 INFO [http-nio-127.0.0.1-8085-exec-78] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'
2022-01-18 04:04:31,082 INFO [http-nio-127.0.0.1-8085-exec-41] [JohnsonFilter] The application is still starting up, or there are errors. Redirecting request from '/plugins/servlet' to '/error/errorEvents.action'

Workaround

In the Operating system, implement a log watch rule that would match [DefaultBootstrapManager] Failed to acquire cluster lock in <bamboo-home>/logs/atlassian-bamboo.log and trigger a full restart of Bamboo.

Notes

mentioned in: Page Loading...

Details

Description

Problem

Environment

Steps to Reproduce

Expected Results

Actual Results

Workaround

Notes

Attachments

Issue Links

Activity

People

Dates