Race condition between BambooClusterNodeHeartbeatService and NodeAliveWatchdog

XMLWordPrintable

    • 1
    • Severity 2 - Major

      Problem

      Is this reproducible on Data Center:

      If the NodeAliveWatchdog thread enters a running state waiting for a DB lock first and times out with failure, and the DB lock access is restored before BambooClusterNodeHeartbeatService times out, Bamboo will shutdown ActiveMQ only and will keep its status as "UP/Active", but no Agents or Queues will work.

      Environment

      Bamboo Data Center 8, 9

      Steps to Reproduce

      1. Start Bamboo completely (we need clear, running threads)
      2. Connect to Bamboo via JMX - I've used VisualVM
      3. Monitor the following threads:
        • atlassian-scheduler-quartz2.local_WorkerX ( From 1 to 4 )
        • nodeHeartbeat.quartz_Worker-1
      4. This part is tricky and bound to randomness and the feasibility to reproduce it depends on the time conditions when Bamboo was initialised and how the scheduler tasks started first:
        • Simulate a DB, CPU or Network issue by setting a Lock the CLUSTER_LOCK table on the DB
          -- make sure to set auto-commit OFF;
          -- lock the table to simulate heavy usage or network issues
          LOCK TABLE CLUSTER_LOCK IN ACCESS EXCLUSIVE MODE;
          SELECT count(LOCK_NAME) FROM CLUSTER_LOCK;
          
        • Monitor the threads on the JMX visualiser, the atlassian-scheduler-quartz2.local_WorkerX thread MUST change to "Running" BEFORE nodeHeartbeat.quartz_Worker-1
        • If atlassian-scheduler-quartz2.local_WorkerX doesn't change to Running before nodeHeartbeat.quartz_Worker-1, send an SQL commit; and start over again from the SQL LOCK TABLE again
      5. There is a "magic number" that is a combination of the 3 minutes of timeout of BambooClusterNodeHeartbeatService and the 20s scheduled task of NodeAliveWatchdog - an offset between 120s and 180s
      6. On the JMX visualiser, observe that both threads will change to "Running"
      7. Once the atlassian-scheduler-quartz2.local_WorkerX/NodeAliveWatchdog thread reaches the magic number, ActiveMQ will crash with:
        2022-11-03 15:27:15,733 ERROR [ActiveMQ Lock KeepAlive Timer] [LockableServiceSupport] bamboo, no longer able to keep the exclusive lock so giving up being a master
        2022-11-03 15:27:15,734 INFO [ActiveMQ Lock KeepAlive Timer] [BrokerService] Apache ActiveMQ 5.16.4 (bamboo, ID:d_bamboo825-40037-1667449215817-0:1) is shutting down
        2022-11-03 15:27:15,737 INFO [ActiveMQ Lock KeepAlive Timer] [TransportConnector] Connector nio+ssl://0.0.0.0:46825?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2 stopped
        2022-11-03 15:27:15,737 INFO [ActiveMQ Transport Server Thread Handler: nio+ssl://0.0.0.0:46825?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2] [TcpTransportServer] socketQueue interrupted - stopping
        2022-11-03 15:27:15,738 INFO [ActiveMQ Transport Server Thread Handler: nio+ssl://0.0.0.0:46825?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2] [TransportConnector] Could not accept connection during shutdown  : null (null)
        2022-11-03 15:27:15,738 INFO [ActiveMQ Lock KeepAlive Timer] [TransportConnector] Connector tcp://localhost:46827?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2 stopped
        2022-11-03 15:27:15,739 INFO [ActiveMQ Lock KeepAlive Timer] [TransportConnector] Connector ssl://0.0.0.0:46826?wireFormat.maxInactivityDuration=300000&transport.enabledProtocols=TLSv1.2 stopped
        2022-11-03 15:27:15,786 WARN [buildTailMessageListenerConnector-1] [FingerprintMatchingMessageListenerContainer] Setup of JMS message listener invoker failed for destination 'queue://com.atlassian.bamboo.buildTailQueue' - trying to recover. Cause: The Consumer is closed
        2022-11-03 15:27:15,786 WARN [bambooAgentMessageListenerConnector-1] [FingerprintMatchingMessageListenerContainer] Setup of JMS message listener invoker failed for destination 'queue://com.atlassian.bamboo.serverQueue' - trying to recover. Cause: The Session is closed
        2022-11-03 15:27:15,786 WARN [bambooHeartBeatMessageListenerConnector-1] [BambooDefaultMessageListenerContainer] Setup of JMS message listener invoker failed for destination 'queue://com.atlassian.bamboo.heartbeatQueue' - trying to recover. Cause: The Consumer is closed
        2022-11-03 15:27:15,800 INFO [ActiveMQ Lock KeepAlive Timer] [TransportConnector] Connector vm://bamboo stopped
        2022-11-03 15:27:15,816 INFO [ActiveMQ Lock KeepAlive Timer] [BrokerPluginSupport] Broker Plugin org.apache.activemq.broker.util.TimeStampingBrokerPlugin stopped
        2022-11-03 15:27:15,817 INFO [ActiveMQ Lock KeepAlive Timer] [PListStoreImpl] PListStore:[/var/atlassian/application-data/bamboo/shared/jms-store/bamboo/tmp_storage] stopped
        2022-11-03 15:27:15,817 INFO [ActiveMQ Lock KeepAlive Timer] [KahaDBStore] Stopping async queue tasks
        2022-11-03 15:27:15,818 INFO [ActiveMQ Lock KeepAlive Timer] [KahaDBStore] Stopping async topic tasks
        2022-11-03 15:27:15,818 INFO [ActiveMQ Lock KeepAlive Timer] [KahaDBStore] Stopped KahaDB
        2022-11-03 15:27:15,826 INFO [ActiveMQ Lock KeepAlive Timer] [BambooAmqClusterLocker] Bamboo amq cluster locker stopped
        2022-11-03 15:27:15,842 INFO [ActiveMQ Lock KeepAlive Timer] [BrokerService] Apache ActiveMQ 5.16.4 (bamboo, ID:d_bamboo825-40037-1667449215817-0:1) uptime 7 minutes
        2022-11-03 15:27:15,842 INFO [ActiveMQ Lock KeepAlive Timer] [BrokerService] Apache ActiveMQ 5.16.4 (bamboo, ID:d_bamboo825-40037-1667449215817-0:1) is shutdown
        2022-11-03 15:27:15,855 INFO [ActiveMQ Connection Executor: vm://bamboo#0] [BrokerService] Using Persistence Adapter: KahaDBPersistenceAdapter[/activemq-data/bamboo/KahaDB]
        2022-11-03 15:27:15,858 INFO [ActiveMQ Connection Executor: vm://bamboo#0] [SharedFileLocker] Database activemq-data/bamboo/KahaDB/lock is locked by another server. This broker is now in slave mode waiting a lock to be acquired
        
      8. Immediately, release the lock on the DB by running an SQL commit; operation. That needs to be done before the BambooClusterNodeHeartbeatService thread reaches the magic number, simulating a system recovery
      9. Bamboo will remain running and ActiveMQ will be shutdown and will not recover

      Expected Results

      1. If the DB lock is recovered on time, Bamboo should restart ActiveMQ if it had been shutdown
        or
      2. Bamboo should perform a complete shutdown if ActiveMQ has exited

      Actual Results

      Bamboo shuts down ActiveMQ and leaves the Application in an inconsistent state leaving no Agents and queueing operational

      Workaround

      Bamboo 8.1.1 and later: If Bamboo is not running on a cluster, you can configure the "-Dbamboo.node.alive.watchdog.timeout=0" system property to disable the node watchdog.

      Notes

      If the DB lock is not released on time with the commit; and the BambooClusterNodeHeartbeatService thread reaches the magic number, Bamboo will then shutdown and behave as expected.

            Assignee:
            Krzysztof Podsiadło
            Reporter:
            Eduardo Alvarenga (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: