Uploaded image for project: 'Bamboo Data Center'
  1. Bamboo Data Center
  2. BAM-25145

On Bamboo instances with large amount of Agents AllAgentsUpdatedEvent may cause "720 seconds" errors

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: High High
    • 9.4.0, 9.3.4
    • 9.2.4, 9.3.3
    • Build Queues
    • None

      Problem

      On a very large Bamboo environment with thousands of Agents, whenever a new Plan/Agent/Capability set is modified, an AllAgentsUpdatedEvent recalculation kicks in and may block the queue and purge queued builds if the recalculation process takes too long to finish.

      Environment

      • Bamboo 9.2.x LTS and 9.3.x (may manifest on later and earlier releases) 

      Steps to Reproduce

      1. Create a very large Bamboo instance with thousands of Agents
      2. Modify the Agents capabilities while the queue is very busy
      3. Observe that during the BuildQueueManagerImpl recalculation no further Queue activity is processed
      4. Have the Agents/Capabilities/Plans changes run serially to cause multiple calls of the AgentAssignmentsUpdatedEvent

      Expected Results

      Queued objects should not get locked during Agent calculations. If something had already been dispatched, it should continue its flow and get picked up by an available Agent. The OrphanedBuildMonitorJob should ignore the time the BuildQueueManagerImpl spent recalculating the queue before considering a Build as orphan/hanging.

      Actual Results

      The OrphanedBuildMonitorJob considers the lockup time while running Recalculations as the total queue time for a build. If that time exceeds the default 720 seconds limit, legitimate builds may end up killed before being dispatched to and Agent.

      Workaround

      1. Reduce the number of Agents if possible (data hygiene for inactive Agents). You can run some SQL statements to find unused Agents:
      2. From Bamboo 9.3.4 and 9.4.0, you can increase the number of ORPHANED_BUILD_MONITOR_JOB_SCHEDULER_REACTION_DELAY_MULTIPLIER. For that you'd have to add a new -Dbamboo.orphaned.build.monitor.reaction.delay.multiplier system property to Bamboo and with a value between 3 and 20.
        The multiplier is applied to the following formula:
        • heartbeatTimeoutSeconds (600s) + ( orphaned.build.monitor.reaction.delay.multiplier (2) * heartbeatTimeout (60s) ) = 720 seconds

      Notes

            [BAM-25145] On Bamboo instances with large amount of Agents AllAgentsUpdatedEvent may cause "720 seconds" errors

            Renata Dornelas made changes -
            Remote Link Original: This issue links to "Page (Atlassian Documentation)" [ 825549 ]
            Eduardo Alvarenga (Inactive) made changes -
            Description Original: h3. Problem

            On a very large Bamboo environment with thousands of Agents, whenever a new Plan/Agent/Capability set is modified, an _AllAgentsUpdatedEvent_ recalculation kicks in and may block the queue and purge queued builds if the recalculation process takes too long to finish.
            h3. Environment
             * Bamboo 9.2.x LTS and 9.3.x (may manifest on later and earlier releases) 

            h3. Steps to Reproduce
             # Create a very large Bamboo instance with thousands of Agents
             # Modify the Agents capabilities while the queue is very busy
             # Observe that during the {{BuildQueueManagerImpl}} recalculation no further Queue activity is processed
             # Have the Agents/Capabilities/Plans changes run serially to cause multiple calls of the {{AgentAssignmentsUpdatedEvent}}

            h3. Expected Results

            Queued objects should not get locked during Agent calculations. If something had already been dispatched, it should continue its flow and get picked up by an available Agent. The {{OrphanedBuildMonitorJob}} should ignore the time the {{BuildQueueManagerImpl}} spent recalculating the queue before considering a Build as orphan/hanging.
            h3. Actual Results

            The {{OrphanedBuildMonitorJob}} considers the lockup time while running Recalculations as the total queue time for a build. If that time exceeds the default 720 seconds limit, legitimate builds may end up killed before being dispatched to and Agent.
            h3. Workaround
             # Reduce the number of Agents if possible (data hygiene for inactive Agents). You can run some SQL statements to find unused Agents:
             ** [Generating Bamboo Agents counts from the database|https://confluence.atlassian.com/bamkb/generating-bamboo-agents-counts-from-the-database-1142234753.html]
             # Increase the number of {*}[ORPHANED_BUILD_MONITOR_JOB_SCHEDULER_REACTION_DELAY_MULTIPLIER|https://docs.atlassian.com/atlassian-bamboo/latest/com/atlassian/bamboo/utils/SystemProperty.html#ORPHANED_BUILD_MONITOR_JOB_SCHEDULER_REACTION_DELAY_MULTIPLIER]{*}. For that you'd have to add a new {{-Dbamboo.orphaned.build.monitor.reaction.delay.multiplier}} [system property|https://confluence.atlassian.com/bamboo/configuring-your-system-properties-289277345.html] to Bamboo and with a value between 3 and 20.
            The multiplier is applied to the following formula:
             ** heartbeatTimeoutSeconds (600s) + ( orphaned.build.monitor.reaction.delay.multiplier (2) * heartbeatTimeout (60s) ) = 720 seconds

            h3. Notes
            New: h3. Problem

            On a very large Bamboo environment with thousands of Agents, whenever a new Plan/Agent/Capability set is modified, an _AllAgentsUpdatedEvent_ recalculation kicks in and may block the queue and purge queued builds if the recalculation process takes too long to finish.
            h3. Environment
             * Bamboo 9.2.x LTS and 9.3.x (may manifest on later and earlier releases) 

            h3. Steps to Reproduce
             # Create a very large Bamboo instance with thousands of Agents
             # Modify the Agents capabilities while the queue is very busy
             # Observe that during the {{BuildQueueManagerImpl}} recalculation no further Queue activity is processed
             # Have the Agents/Capabilities/Plans changes run serially to cause multiple calls of the {{AgentAssignmentsUpdatedEvent}}

            h3. Expected Results

            Queued objects should not get locked during Agent calculations. If something had already been dispatched, it should continue its flow and get picked up by an available Agent. The {{OrphanedBuildMonitorJob}} should ignore the time the {{BuildQueueManagerImpl}} spent recalculating the queue before considering a Build as orphan/hanging.
            h3. Actual Results

            The {{OrphanedBuildMonitorJob}} considers the lockup time while running Recalculations as the total queue time for a build. If that time exceeds the default 720 seconds limit, legitimate builds may end up killed before being dispatched to and Agent.
            h3. Workaround
             # Reduce the number of Agents if possible (data hygiene for inactive Agents). You can run some SQL statements to find unused Agents:
             ** [Generating Bamboo Agents counts from the database|https://confluence.atlassian.com/bamkb/generating-bamboo-agents-counts-from-the-database-1142234753.html]
             # From Bamboo 9.3.4 and 9.4.0, you can increase the number of {*}[ORPHANED_BUILD_MONITOR_JOB_SCHEDULER_REACTION_DELAY_MULTIPLIER|https://docs.atlassian.com/atlassian-bamboo/latest/com/atlassian/bamboo/utils/SystemProperty.html#ORPHANED_BUILD_MONITOR_JOB_SCHEDULER_REACTION_DELAY_MULTIPLIER]{*}. For that you'd have to add a new {{-Dbamboo.orphaned.build.monitor.reaction.delay.multiplier}} [system property|https://confluence.atlassian.com/bamboo/configuring-your-system-properties-289277345.html] to Bamboo and with a value between 3 and 20.
            The multiplier is applied to the following formula:
             ** heartbeatTimeoutSeconds (600s) + ( orphaned.build.monitor.reaction.delay.multiplier (2) * heartbeatTimeout (60s) ) = 720 seconds

            h3. Notes
            Eduardo Alvarenga (Inactive) made changes -
            Fix Version/s Original: 9.2.7 [ 106156 ]
            Eduardo Alvarenga (Inactive) made changes -
            Description Original: h3. Problem

            On a very large Bamboo environment with thousands of Agents, whenever a new Plan/Agent/Capability set is modified, an _AllAgentsUpdatedEvent_ recalculation kicks in and may block the queue and purge queued builds if the recalculation process takes too long to finish.
            h3. Environment
             * Bamboo 9.2.x LTS and 9.3.x (may manifest on later and earlier releases) 

            h3. Steps to Reproduce
             # Create a very large Bamboo instance with thousands of Agents
             # Modify the Agents capabilities while the queue is very busy
             # Observe that during the {{BuildQueueManagerImpl}} recalculation no further Queue activity is processed
             # Have the Agents/Capabilities/Plans changes run serially to cause multiple calls of the {{AgentAssignmentsUpdatedEvent}}

            h3. Expected Results

            Queued objects should not get locked during Agent calculations. If something had already been dispatched, it should continue its flow and get picked up by an available Agent. The {{OrphanedBuildMonitorJob}} should ignore the time the {{BuildQueueManagerImpl}} spent recalculating the queue before considering a Build as orphan/hanging.
            h3. Actual Results

            The {{OrphanedBuildMonitorJob}} considers the lockup time while running Recalculations as the total queue time for a build. If that time exceeds the default 720 seconds limit, legitimate builds may end up killed before being dispatched to and Agent.
            h3. Workaround
             # Reduce the number of Agents if possible (data hygiene for inactive Agents). You can run some SQL statements to find unused Agents:
             ** [Generating Bamboo Agents counts from the database|https://confluence.atlassian.com/bamkb/generating-bamboo-agents-counts-from-the-database-1142234753.html]
             # Increase the number of {*}[ORPHANED_BUILD_MONITOR_JOB_SCHEDULER_REACTION_DELAY_MULTIPLIER|https://docs.atlassian.com/atlassian-bamboo/latest/com/atlassian/bamboo/utils/SystemProperty.html#:~:text=0%200%2010%2C25%20*%20%3F%22-,ORPHANED_BUILD_MONITOR_JOB_SCHEDULER_REACTION_DELAY_MULTIPLIER,-public%20static%C2%A0]{*}. For that you'd have to add a new {{-Dbamboo.orphaned.build.monitor.reaction.delay.multiplier}} [system property|https://confluence.atlassian.com/bamboo/configuring-your-system-properties-289277345.html] to Bamboo and with a value between 3 and 20.
            The multiplier is applied to the following formula:
             ** heartbeatTimeoutSeconds (600s) + ( orphaned.build.monitor.reaction.delay.multiplier (2) * heartbeatTimeout (60s) ) = 720 seconds

            h3. Notes
            New: h3. Problem

            On a very large Bamboo environment with thousands of Agents, whenever a new Plan/Agent/Capability set is modified, an _AllAgentsUpdatedEvent_ recalculation kicks in and may block the queue and purge queued builds if the recalculation process takes too long to finish.
            h3. Environment
             * Bamboo 9.2.x LTS and 9.3.x (may manifest on later and earlier releases) 

            h3. Steps to Reproduce
             # Create a very large Bamboo instance with thousands of Agents
             # Modify the Agents capabilities while the queue is very busy
             # Observe that during the {{BuildQueueManagerImpl}} recalculation no further Queue activity is processed
             # Have the Agents/Capabilities/Plans changes run serially to cause multiple calls of the {{AgentAssignmentsUpdatedEvent}}

            h3. Expected Results

            Queued objects should not get locked during Agent calculations. If something had already been dispatched, it should continue its flow and get picked up by an available Agent. The {{OrphanedBuildMonitorJob}} should ignore the time the {{BuildQueueManagerImpl}} spent recalculating the queue before considering a Build as orphan/hanging.
            h3. Actual Results

            The {{OrphanedBuildMonitorJob}} considers the lockup time while running Recalculations as the total queue time for a build. If that time exceeds the default 720 seconds limit, legitimate builds may end up killed before being dispatched to and Agent.
            h3. Workaround
             # Reduce the number of Agents if possible (data hygiene for inactive Agents). You can run some SQL statements to find unused Agents:
             ** [Generating Bamboo Agents counts from the database|https://confluence.atlassian.com/bamkb/generating-bamboo-agents-counts-from-the-database-1142234753.html]
             # Increase the number of {*}[ORPHANED_BUILD_MONITOR_JOB_SCHEDULER_REACTION_DELAY_MULTIPLIER|https://docs.atlassian.com/atlassian-bamboo/latest/com/atlassian/bamboo/utils/SystemProperty.html#ORPHANED_BUILD_MONITOR_JOB_SCHEDULER_REACTION_DELAY_MULTIPLIER]{*}. For that you'd have to add a new {{-Dbamboo.orphaned.build.monitor.reaction.delay.multiplier}} [system property|https://confluence.atlassian.com/bamboo/configuring-your-system-properties-289277345.html] to Bamboo and with a value between 3 and 20.
            The multiplier is applied to the following formula:
             ** heartbeatTimeoutSeconds (600s) + ( orphaned.build.monitor.reaction.delay.multiplier (2) * heartbeatTimeout (60s) ) = 720 seconds

            h3. Notes
            Shashank Kumar made changes -
            Remote Link Original: This issue links to "Page (Confluence)" [ 851371 ]
            Shashank Kumar made changes -
            Remote Link New: This issue links to "Page (Confluence)" [ 851371 ]
            Wioletta Dys made changes -
            Remote Link New: This issue links to "Page (Confluence)" [ 849854 ]
            Alexey Chystoprudov made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Waiting for Release [ 12075 ] New: Closed [ 6 ]
            Eduardo Alvarenga (Inactive) made changes -
            Remote Link New: This issue links to "Page (Atlassian Documentation)" [ 825549 ]
            Mateusz Szmal made changes -
            Fix Version/s New: 9.3.4 [ 105764 ]
            Fix Version/s New: 9.4.0 [ 105148 ]
            Fix Version/s New: 9.2.7 [ 106156 ]

              851f15845f55 Mateusz Szmal
              73868399605e Eduardo Alvarenga (Inactive)
              Affected customers:
              0 This affects my team
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: