Loading...

Type: Bug
Resolution: Fixed
Priority: High
Fix Version/s: 9.0.2, 9.3.0, 9.1.1, 9.2.1
Affects Version/s: 8.2.6, 9.0.1
Component/s: Data center, Infrastructure
Labels:
None

Support reference count:
1
Symptom Severity:
Severity 2 - Major

Problem

Is this reproducible on Data Center:

If the NodeAliveWatchdog thread enters a running state waiting for a DB lock first and times out with failure, and the DB lock access is restored before BambooClusterNodeHeartbeatService times out, Bamboo will shutdown ActiveMQ only and will keep its status as "UP/Active", but no Agents or Queues will work.

Environment

Bamboo Data Center 8, 9

Steps to Reproduce

Start Bamboo completely (we need clear, running threads)
Connect to Bamboo via JMX - I've used VisualVM
Monitor the following threads:
- atlassian-scheduler-quartz2.local_WorkerX ( From 1 to 4 )
- nodeHeartbeat.quartz_Worker-1
This part is tricky and bound to randomness and the feasibility to reproduce it depends on the time conditions when Bamboo was initialised and how the scheduler tasks started first:
- Simulate a DB, CPU or Network issue by setting a Lock the CLUSTER_LOCK table on the DB
```
-- make sure to set auto-commit OFF;
-- lock the table to simulate heavy usage or network issues
LOCK TABLE CLUSTER_LOCK IN ACCESS EXCLUSIVE MODE;
SELECT count(LOCK_NAME) FROM CLUSTER_LOCK;
```
- Monitor the threads on the JMX visualiser, the atlassian-scheduler-quartz2.local_WorkerX thread MUST change to "Running" BEFORE nodeHeartbeat.quartz_Worker-1
- If atlassian-scheduler-quartz2.local_WorkerX doesn't change to Running before nodeHeartbeat.quartz_Worker-1, send an SQL commit; and start over again from the SQL LOCK TABLE again
There is a "magic number" that is a combination of the 3 minutes of timeout of BambooClusterNodeHeartbeatService and the 20s scheduled task of NodeAliveWatchdog - an offset between 120s and 180s
On the JMX visualiser, observe that both threads will change to "Running"

Once the atlassian-scheduler-quartz2.local_WorkerX/NodeAliveWatchdog thread reaches the magic number, ActiveMQ will crash with:

2022-11-03 15:27:15,733 ERROR [ActiveMQ Lock KeepAlive Timer] [LockableServiceSupport] bamboo, no longer able to keep the exclusive lock so giving up being a master
2022-11-03 15:27:15,734 INFO [ActiveMQ Lock KeepAlive Timer] [BrokerService] Apache ActiveMQ 5.16.4 (bamboo, ID:d_bamboo825-40037-1667449215817-0:1) is shutting down
2022-11-03 15:27:15,737 INFO [ActiveMQ Lock KeepAlive Timer] [TransportConnector] Connector nio+ssl://0.0.0.0:46825?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2 stopped
2022-11-03 15:27:15,737 INFO [ActiveMQ Transport Server Thread Handler: nio+ssl://0.0.0.0:46825?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2] [TcpTransportServer] socketQueue interrupted - stopping
2022-11-03 15:27:15,738 INFO [ActiveMQ Transport Server Thread Handler: nio+ssl://0.0.0.0:46825?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2] [TransportConnector] Could not accept connection during shutdown  : null (null)
2022-11-03 15:27:15,738 INFO [ActiveMQ Lock KeepAlive Timer] [TransportConnector] Connector tcp://localhost:46827?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2 stopped
2022-11-03 15:27:15,739 INFO [ActiveMQ Lock KeepAlive Timer] [TransportConnector] Connector ssl://0.0.0.0:46826?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2 stopped
2022-11-03 15:27:15,786 WARN [buildTailMessageListenerConnector-1] [FingerprintMatchingMessageListenerContainer] Setup of JMS message listener invoker failed for destination 'queue://com.atlassian.bamboo.buildTailQueue' - trying to recover. Cause: The Consumer is closed
2022-11-03 15:27:15,786 WARN [bambooAgentMessageListenerConnector-1] [FingerprintMatchingMessageListenerContainer] Setup of JMS message listener invoker failed for destination 'queue://com.atlassian.bamboo.serverQueue' - trying to recover. Cause: The Session is closed
2022-11-03 15:27:15,786 WARN [bambooHeartBeatMessageListenerConnector-1] [BambooDefaultMessageListenerContainer] Setup of JMS message listener invoker failed for destination 'queue://com.atlassian.bamboo.heartbeatQueue' - trying to recover. Cause: The Consumer is closed
2022-11-03 15:27:15,800 INFO [ActiveMQ Lock KeepAlive Timer] [TransportConnector] Connector vm://bamboo stopped
2022-11-03 15:27:15,816 INFO [ActiveMQ Lock KeepAlive Timer] [BrokerPluginSupport] Broker Plugin org.apache.activemq.broker.util.TimeStampingBrokerPlugin stopped
2022-11-03 15:27:15,817 INFO [ActiveMQ Lock KeepAlive Timer] [PListStoreImpl] PListStore:[/var/atlassian/application-data/bamboo/shared/jms-store/bamboo/tmp_storage] stopped
2022-11-03 15:27:15,817 INFO [ActiveMQ Lock KeepAlive Timer] [KahaDBStore] Stopping async queue tasks
2022-11-03 15:27:15,818 INFO [ActiveMQ Lock KeepAlive Timer] [KahaDBStore] Stopping async topic tasks
2022-11-03 15:27:15,818 INFO [ActiveMQ Lock KeepAlive Timer] [KahaDBStore] Stopped KahaDB
2022-11-03 15:27:15,826 INFO [ActiveMQ Lock KeepAlive Timer] [BambooAmqClusterLocker] Bamboo amq cluster locker stopped
2022-11-03 15:27:15,842 INFO [ActiveMQ Lock KeepAlive Timer] [BrokerService] Apache ActiveMQ 5.16.4 (bamboo, ID:d_bamboo825-40037-1667449215817-0:1) uptime 7 minutes
2022-11-03 15:27:15,842 INFO [ActiveMQ Lock KeepAlive Timer] [BrokerService] Apache ActiveMQ 5.16.4 (bamboo, ID:d_bamboo825-40037-1667449215817-0:1) is shutdown
2022-11-03 15:27:15,855 INFO [ActiveMQ Connection Executor: vm://bamboo#0] [BrokerService] Using Persistence Adapter: KahaDBPersistenceAdapter[/activemq-data/bamboo/KahaDB]
2022-11-03 15:27:15,858 INFO [ActiveMQ Connection Executor: vm://bamboo#0] [SharedFileLocker] Database activemq-data/bamboo/KahaDB/lock is locked by another server. This broker is now in slave mode waiting a lock to be acquired

Immediately, release the lock on the DB by running an SQL commit; operation. That needs to be done before the BambooClusterNodeHeartbeatService thread reaches the magic number, simulating a system recovery
Bamboo will remain running and ActiveMQ will be shutdown and will not recover

Expected Results

If the DB lock is recovered on time, Bamboo should restart ActiveMQ if it had been shutdown
or
Bamboo should perform a complete shutdown if ActiveMQ has exited

Actual Results

Bamboo shuts down ActiveMQ and leaves the Application in an inconsistent state leaving no Agents and queueing operational

Workaround

Bamboo 8.1.1 and later: If Bamboo is not running on a cluster, you can configure the "-Dbamboo.node.alive.watchdog.timeout=0" system property to disable the node watchdog.

Notes

If the DB lock is not released on time with the commit; and the BambooClusterNodeHeartbeatService thread reaches the magic number, Bamboo will then shutdown and behave as expected.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List

thread-race.png
38 kB
03/Nov/2022 7:22 AM

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

Details

Description

Problem

Environment

Steps to Reproduce

Expected Results

Actual Results

Workaround

Notes

Attachments

Attachments

Issue Links

Activity

People

Dates