-
Type:
Bug
-
Resolution: Fixed
-
Priority:
High
-
Affects Version/s: 8.2.6, 9.0.1
-
Component/s: Data center, Infrastructure
-
None
-
1
-
Severity 2 - Major
Problem
Is this reproducible on Data Center: ![]()
If the NodeAliveWatchdog thread enters a running state waiting for a DB lock first and times out with failure, and the DB lock access is restored before BambooClusterNodeHeartbeatService times out, Bamboo will shutdown ActiveMQ only and will keep its status as "UP/Active", but no Agents or Queues will work.
Environment
Bamboo Data Center 8, 9
Steps to Reproduce
- Start Bamboo completely (we need clear, running threads)
- Connect to Bamboo via JMX - I've used VisualVM
- Monitor the following threads:
- atlassian-scheduler-quartz2.local_WorkerX ( From 1 to 4 )
- nodeHeartbeat.quartz_Worker-1
- This part is tricky and bound to randomness and the feasibility to reproduce it depends on the time conditions when Bamboo was initialised and how the scheduler tasks started first:
- Simulate a DB, CPU or Network issue by setting a Lock the CLUSTER_LOCK table on the DB
-- make sure to set auto-commit OFF; -- lock the table to simulate heavy usage or network issues LOCK TABLE CLUSTER_LOCK IN ACCESS EXCLUSIVE MODE; SELECT count(LOCK_NAME) FROM CLUSTER_LOCK;
- Monitor the threads on the JMX visualiser, the atlassian-scheduler-quartz2.local_WorkerX thread MUST change to "Running" BEFORE nodeHeartbeat.quartz_Worker-1
- If atlassian-scheduler-quartz2.local_WorkerX doesn't change to Running before nodeHeartbeat.quartz_Worker-1, send an SQL commit; and start over again from the SQL LOCK TABLE again
- Simulate a DB, CPU or Network issue by setting a Lock the CLUSTER_LOCK table on the DB
- There is a "magic number" that is a combination of the 3 minutes of timeout of BambooClusterNodeHeartbeatService and the 20s scheduled task of NodeAliveWatchdog - an offset between 120s and 180s
- On the JMX visualiser, observe that both threads will change to "Running"

- Once the atlassian-scheduler-quartz2.local_WorkerX/NodeAliveWatchdog thread reaches the magic number, ActiveMQ will crash with:
2022-11-03 15:27:15,733 ERROR [ActiveMQ Lock KeepAlive Timer] [LockableServiceSupport] bamboo, no longer able to keep the exclusive lock so giving up being a master 2022-11-03 15:27:15,734 INFO [ActiveMQ Lock KeepAlive Timer] [BrokerService] Apache ActiveMQ 5.16.4 (bamboo, ID:d_bamboo825-40037-1667449215817-0:1) is shutting down 2022-11-03 15:27:15,737 INFO [ActiveMQ Lock KeepAlive Timer] [TransportConnector] Connector nio+ssl://0.0.0.0:46825?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2 stopped 2022-11-03 15:27:15,737 INFO [ActiveMQ Transport Server Thread Handler: nio+ssl://0.0.0.0:46825?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2] [TcpTransportServer] socketQueue interrupted - stopping 2022-11-03 15:27:15,738 INFO [ActiveMQ Transport Server Thread Handler: nio+ssl://0.0.0.0:46825?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2] [TransportConnector] Could not accept connection during shutdown : null (null) 2022-11-03 15:27:15,738 INFO [ActiveMQ Lock KeepAlive Timer] [TransportConnector] Connector tcp://localhost:46827?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2 stopped 2022-11-03 15:27:15,739 INFO [ActiveMQ Lock KeepAlive Timer] [TransportConnector] Connector ssl://0.0.0.0:46826?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2 stopped 2022-11-03 15:27:15,786 WARN [buildTailMessageListenerConnector-1] [FingerprintMatchingMessageListenerContainer] Setup of JMS message listener invoker failed for destination 'queue://com.atlassian.bamboo.buildTailQueue' - trying to recover. Cause: The Consumer is closed 2022-11-03 15:27:15,786 WARN [bambooAgentMessageListenerConnector-1] [FingerprintMatchingMessageListenerContainer] Setup of JMS message listener invoker failed for destination 'queue://com.atlassian.bamboo.serverQueue' - trying to recover. Cause: The Session is closed 2022-11-03 15:27:15,786 WARN [bambooHeartBeatMessageListenerConnector-1] [BambooDefaultMessageListenerContainer] Setup of JMS message listener invoker failed for destination 'queue://com.atlassian.bamboo.heartbeatQueue' - trying to recover. Cause: The Consumer is closed 2022-11-03 15:27:15,800 INFO [ActiveMQ Lock KeepAlive Timer] [TransportConnector] Connector vm://bamboo stopped 2022-11-03 15:27:15,816 INFO [ActiveMQ Lock KeepAlive Timer] [BrokerPluginSupport] Broker Plugin org.apache.activemq.broker.util.TimeStampingBrokerPlugin stopped 2022-11-03 15:27:15,817 INFO [ActiveMQ Lock KeepAlive Timer] [PListStoreImpl] PListStore:[/var/atlassian/application-data/bamboo/shared/jms-store/bamboo/tmp_storage] stopped 2022-11-03 15:27:15,817 INFO [ActiveMQ Lock KeepAlive Timer] [KahaDBStore] Stopping async queue tasks 2022-11-03 15:27:15,818 INFO [ActiveMQ Lock KeepAlive Timer] [KahaDBStore] Stopping async topic tasks 2022-11-03 15:27:15,818 INFO [ActiveMQ Lock KeepAlive Timer] [KahaDBStore] Stopped KahaDB 2022-11-03 15:27:15,826 INFO [ActiveMQ Lock KeepAlive Timer] [BambooAmqClusterLocker] Bamboo amq cluster locker stopped 2022-11-03 15:27:15,842 INFO [ActiveMQ Lock KeepAlive Timer] [BrokerService] Apache ActiveMQ 5.16.4 (bamboo, ID:d_bamboo825-40037-1667449215817-0:1) uptime 7 minutes 2022-11-03 15:27:15,842 INFO [ActiveMQ Lock KeepAlive Timer] [BrokerService] Apache ActiveMQ 5.16.4 (bamboo, ID:d_bamboo825-40037-1667449215817-0:1) is shutdown 2022-11-03 15:27:15,855 INFO [ActiveMQ Connection Executor: vm://bamboo#0] [BrokerService] Using Persistence Adapter: KahaDBPersistenceAdapter[/activemq-data/bamboo/KahaDB] 2022-11-03 15:27:15,858 INFO [ActiveMQ Connection Executor: vm://bamboo#0] [SharedFileLocker] Database activemq-data/bamboo/KahaDB/lock is locked by another server. This broker is now in slave mode waiting a lock to be acquired
- Immediately, release the lock on the DB by running an SQL commit; operation. That needs to be done before the BambooClusterNodeHeartbeatService thread reaches the magic number, simulating a system recovery
- Bamboo will remain running and ActiveMQ will be shutdown and will not recover
Expected Results
- If the DB lock is recovered on time, Bamboo should restart ActiveMQ if it had been shutdown
or - Bamboo should perform a complete shutdown if ActiveMQ has exited
Actual Results
Bamboo shuts down ActiveMQ and leaves the Application in an inconsistent state leaving no Agents and queueing operational
Workaround
Bamboo 8.1.1 and later: If Bamboo is not running on a cluster, you can configure the "-Dbamboo.node.alive.watchdog.timeout=0" system property to disable the node watchdog.
Notes
If the DB lock is not released on time with the commit; and the BambooClusterNodeHeartbeatService thread reaches the magic number, Bamboo will then shutdown and behave as expected.