-
Bug
-
Resolution: Fixed
-
High
-
9.2.4, 9.3.3
-
None
-
1
-
Severity 2 - Major
-
Problem
On a very large Bamboo environment with thousands of Agents, whenever a new Plan/Agent/Capability set is modified, an AllAgentsUpdatedEvent recalculation kicks in and may block the queue and purge queued builds if the recalculation process takes too long to finish.
Environment
- Bamboo 9.2.x LTS and 9.3.x (may manifest on later and earlier releases)
Steps to Reproduce
- Create a very large Bamboo instance with thousands of Agents
- Modify the Agents capabilities while the queue is very busy
- Observe that during the BuildQueueManagerImpl recalculation no further Queue activity is processed
- Have the Agents/Capabilities/Plans changes run serially to cause multiple calls of the AgentAssignmentsUpdatedEvent
Expected Results
Queued objects should not get locked during Agent calculations. If something had already been dispatched, it should continue its flow and get picked up by an available Agent. The OrphanedBuildMonitorJob should ignore the time the BuildQueueManagerImpl spent recalculating the queue before considering a Build as orphan/hanging.
Actual Results
The OrphanedBuildMonitorJob considers the lockup time while running Recalculations as the total queue time for a build. If that time exceeds the default 720 seconds limit, legitimate builds may end up killed before being dispatched to and Agent.
Workaround
- Reduce the number of Agents if possible (data hygiene for inactive Agents). You can run some SQL statements to find unused Agents:
- From Bamboo 9.3.4 and 9.4.0, you can increase the number of ORPHANED_BUILD_MONITOR_JOB_SCHEDULER_REACTION_DELAY_MULTIPLIER. For that you'd have to add a new -Dbamboo.orphaned.build.monitor.reaction.delay.multiplier system property to Bamboo and with a value between 3 and 20.
The multiplier is applied to the following formula:- heartbeatTimeoutSeconds (600s) + ( orphaned.build.monitor.reaction.delay.multiplier (2) * heartbeatTimeout (60s) ) = 720 seconds