Uploaded image for project: 'Bamboo Data Center'
  1. Bamboo Data Center
  2. BAM-22001

Bamboo cannot correctly handle parallel processing of build results of Jobs within the same Stage

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Medium
    • 9.1.0, 9.0.2
    • 8.2.5, 8.2.6, 8.1.10, 9.0.1
    • Agents, Builds, Labels, Stages
    • None

    Description

      Problem

      Bamboo cannot correctly handle the parallel processing of build results of Jobs within the same Stage.

      For example, say you have Job1 and Job2, if Bamboo starts processing the Build results for Job2 whilst Job1 results are still being processed, Bamboo sets Job2 back to an InProgress state, leaving Job2 in an inconsistent state and Job1 results finishing successfully.

      Environment

      • Bamboo 9 (possibly reproducible on other versions)
      • Multiple parallel builds with large logs
      • At least 2 Remote agents (to reproduce the issue)

      Steps to Reproduce

      1. Have a Bamboo 9.0 instance with two remote agents (make sure to disable any local agents)
      2. Set expensive message processing threads to 4 in setenv.sh:
        -Dbamboo.max.concurrent.expensive.messages=4
        
      3. Turn on concurrent builds at Bamboo Administration >> Concurrent Builds and set the limit to 100
      4. Create a simple plan with a job with 1 script task
        Downloads a dictionary if needed, prints +~30k chars in a single line x 16
        #!/bin/bash
        for i in $(seq 1 16) ; do 
          if [ ! -f words.txt ] ; then
            curl -o words.txt "https://gist.githubusercontent.com/wchargin/8927565/raw/d9783627c731268fb2935a731a618aa8e95cf465/words"
          fi
          shuf -n 3000 words.txt | awk '{print}' ORS=' '
          echo -n artists
          echo
        done
        
      5. In Job Configuration >> Other, enable Pattern Match Labelling
        • Make sure the regex you use provides a match. For example, in my string, there was an instance of "artists". I set the Regex to
          .*artists.*
        • Add a label, I used "jira" as the label but I don't think this is significant.
      6. Clone this job so you now have a second job that also does this.
      7. Use a script to spam the build queue with this plan we created:
        #!/bin/bash
        for i in {1..10} ; do
          curl -k -X POST -H 'Authorization: Bearer <user_PAT_key>' \
            'https://<Bamboo_URL/rest/api/latest/queue/TP-TBP'
        done
        
        • Fill out the plan key and PAT (Personal Access Token)
        • Run the script a couple of times to launch 10 builds in a row, spam it fast but you can probably use any web benchmarking/load-testing tools to generate a lot of calls.
      8. Monitor your database with this SQL:
        SELECT * FROM buildresultsummary WHERE BUILD_KEY LIKE 'PLAN-KEY%' AND BUILD_STATE = 'Unknown' and life_cycle_state = 'InProgress';
        
        • Fix the plan key to the one you created.
      9. Eventually, you should get rows for jobs that are "InProgress" built up. If not, run the script some more but I'm usually able to get 3-5 per run.
      10. Wait about 15 minutes after the builds are complete for the ones that remain in an "InProgress" state to be marked as orphaned. E.g.
        simple	26-Oct-2022 19:04:04	Finished building AA-AB-JOB1-471.
        error	26-Oct-2022 19:17:33	Build AA-AB-JOB1-471 had to be cancelled: it was marked as in progress in DB but Bamboo has no record of this build.
        
        2022-10-26 19:16:33,969 INFO [scheduler_Worker-5] [OrphanedBuildMonitorJob] AA-AB-JOB1-453 marked as InProgress but not present in CBC since Wed Oct 26 19:03:34 AEDT 2022
        2022-10-26 19:16:33,969 ERROR [scheduler_Worker-5] [OrphanedBuildMonitorJob] Build AA-AB-JOB1-453 had to be cancelled: it was marked as in progress in DB but Bamboo has no record of this build.
        2022-10-26 19:16:33,993 INFO [scheduler_Worker-5] [OrphanedBuildMonitorJob] AA-AB-D2-445 marked as InProgress but not present in CBC since Wed Oct 26 19:03:34 AEDT 2022
        2022-10-26 19:16:33,994 ERROR [scheduler_Worker-5] [OrphanedBuildMonitorJob] Build AA-AB-D2-445 had to be cancelled: it was marked as in progress in DB but Bamboo has no record of this build.
        2022-10-26 19:16:34,011 INFO [scheduler_Worker-5] [OrphanedBuildMonitorJob] AA-AB-JOB1-425 marked as InProgress but not present in CBC since Wed Oct 26 19:03:34 AEDT 2022
        2022-10-26 19:16:34,011 ERROR [scheduler_Worker-5] [OrphanedBuildMonitorJob] Build AA-AB-JOB1-425 had to be cancelled: it was marked as in progress in DB but Bamboo has no record of this build.
        2022-10-26 19:17:33,966 INFO [scheduler_Worker-2] [OrphanedBuildMonitorJob] AA-AB-JOB1-471 marked as InProgress but not present in CBC since Wed Oct 26 19:04:33 AEDT 2022
        2022-10-26 19:17:33,966 ERROR [scheduler_Worker-2] [OrphanedBuildMonitorJob] Build AA-AB-JOB1-471 had to be cancelled: it was marked as in progress in DB but Bamboo has no record of this build.

      Expected Results

      Concurrent builds from the same Stage should finish and their results should be processed by Bamboo and not be changed.

      Actual Results

      There is a race condition that modifies the status of a Job back to InProgress when Bamboo starts processing Job results in parallel within the same stage of a Plan. That adds an inconsistent state to the DB and forces Bamboo to kill the Job state via the OrphanedBuildMonitorJob a few minutes later.

      We have set a custom DB trigger listener which listens for any related inserts/updates and as a result, we can notice that result processing is successful, but something sets the build to an InProgress state after that:

      2022-10-27 10:29:11.464905,1771182,Pending,TP-TBP-JOB2,Job2,680
      2022-10-27 10:29:11.464905,1771182,Pending,TP-TBP-JOB2,Job2,680
      2022-10-27 10:29:29.299236,1771182,Queued,TP-TBP-JOB2,Job2,680
      2022-10-27 10:29:32.311414,1771182,InProgress,TP-TBP-JOB2,Job2,680
      2022-10-27 10:29:33.770960,1771182,InProgress,TP-TBP-JOB2,Job2,680
      2022-10-27 10:29:34.203115,1771182,InProgress,TP-TBP-JOB2,Job2,680
      2022-10-27 10:29:34.934852,1771182,Finished,TP-TBP-JOB2,Job2,680
      2022-10-27 10:29:34.934852,1771182,Finished,TP-TBP-JOB2,Job2,680
      2022-10-27 10:29:35.075755,1771182,InProgress,TP-TBP-JOB2,Job2,680

      This is a very rare corner case, and the potential fix could add some guards (it is not an obvious and easy thing because of the re-run functionality). 

      The events about setting the InProgress state are sent from the agent. That condition is required as a Remote Agent adds certain latency between the results being sent and processed. Due to that, this bug is very hard to be reproduced in a local, same-machine environment.

      Workarounds

      1. Remove/disable any Pattern Match Labelling used by the Jobs in the Plan
      2. Reduce the log noise from the builds
      3. Add some seconds of sleep at the end of the task to allow Bamboo some time to process the log volume

      Notes

      Attachments

        Issue Links

          Activity

            People

              851f15845f55 Mateusz Szmal
              73868399605e Eduardo Alvarenga
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: