Uploaded image for project: 'Bitbucket Cloud'
  1. Bitbucket Cloud
  2. BCLOUD-22851

Self-hosted runners do not always clear the docker mount directory

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: High High
    • Pipelines - Runners
    • None

      Issue Summary

      • Occasionally, users report that they receive the following error when executing their runner builds:
        bash: docker: command not found
        
      • This issue is caused by the local Runner directory /tmp/<runner_uuid>/ already containing an empty docker folder.
      • We are not yet sure why the Runner directory sometimes already contains an empty docker folder (theory is that it's because of an unsuccessful runner configuration that has not cleaned itself up)
      • Since the docker folder already exists, the Runner fails to mount the new docker directory and the docker binary won't be present during the build.

      This is reproducible on Data Center: no

      Steps to Reproduce

      1. Execute a build within a dockerized Self-hosted Runner
      2. Observe the error that occurs (this is intermittent - hard to reproduce)

      Expected Results

      • The build is able to execute

      Actual Results

      • The build fails as it is unable to mount the directory

      Workaround

            [BCLOUD-22851] Self-hosted runners do not always clear the docker mount directory

            Radu Cristescu added a comment - - edited

            Today, two of my self-hosted runners started showing this error for pipe: trigger-pipeline calls. It was all out of the blue. Both of them left behind an empty docker directory around the same time: 15:57 and 16:00 GMT.

            Both runners are sharing the same server, which has 4 runners on it. I'm guessing that if I had enough load, all 4 of them would have shown the symptom.

            I deleted both empty docker directories, reran the pipeline, and the pipe failed again, exactly the same way, leaving behind an empty docker directory.

            Restarting the runners after deleting the empty directory fixes the issue, and I see an executable file called docker in there.

            Do you know what that looks like? Looks like what happens when I docker -v /file/that/does/not/exist:/dest. It would create /file/that/does/not/exist as a directory on the host, and mount it as such.

            Radu Cristescu added a comment - - edited Today, two of my self-hosted runners started showing this error for pipe: trigger-pipeline calls. It was all out of the blue. Both of them left behind an empty docker directory around the same time: 15:57 and 16:00 GMT. Both runners are sharing the same server, which has 4 runners on it. I'm guessing that if I had enough load, all 4 of them would have shown the symptom. I deleted both empty docker directories, reran the pipeline, and the pipe failed again, exactly the same way, leaving behind an empty docker directory. Restarting the runners after deleting the empty directory fixes the issue, and I see an executable file called docker in there. Do you know what that looks like? Looks like what happens when I docker -v /file/that/does/not/exist:/dest . It would create /file/that/does/not/exist as a directory on the host, and mount it as such.

            Our self hosted continues to experience this failure, varies between a couple of times a week to every day when we have very active projects approaching deadlines.

            Surprised it has not yet been solved.

            Tom Emerson added a comment - Our self hosted continues to experience this failure, varies between a couple of times a week to every day when we have very active projects approaching deadlines. Surprised it has not yet been solved.

            Dan Milman added a comment -

            This is a major issue for us and keeps us constantly having failed builds requiring manual remediation.
            Please fix!

            Dan Milman added a comment - This is a major issue for us and keeps us constantly having failed builds requiring manual remediation. Please fix!

            Jumping in here to add to the frustration claim. This is a major issue for us and keeps us constantly having failed builds requiring manual remediation on servers that only a few resources can access.

            Alex Figliolia added a comment - Jumping in here to add to the frustration claim. This is a major issue for us and keeps us constantly having failed builds requiring manual remediation on servers that only a few resources can access.

            Curtis added a comment -

            One thing we've noticed regarding the failures is that it seems to occur when a Pipeline with a Pipe integration is executed.  Our Runners that execute Pipelines that have no Pipe configuration are all running fine.

            Curtis added a comment - One thing we've noticed regarding the failures is that it seems to occur when a Pipeline with a Pipe integration is executed.  Our Runners that execute Pipelines that have no Pipe configuration are all running fine.

            Ben added a comment -

            This is causing a big issue for our organisation. We have many agents that only a few people are allowed to administer. When this occurs it the fix whilst straight forward takes a long time to roll out to all agents.

            Ben added a comment - This is causing a big issue for our organisation. We have many agents that only a few people are allowed to administer. When this occurs it the fix whilst straight forward takes a long time to roll out to all agents.

            I can confirm that this issue occurs relatively frequently and is quite frustrating when it happens, as not everyone has access to runners. As a result, the entire company’s pipeline remains blocked until a manual fix is applied. Describing the severity as ‘minor’ is, in my opinion, an understatement.

            Bojan Kopanja added a comment - I can confirm that this issue occurs relatively frequently and is quite frustrating when it happens, as not everyone has access to runners. As a result, the entire company’s pipeline remains blocked until a manual fix is applied. Describing the severity as ‘minor’ is, in my opinion, an understatement.

            I would like to appeal to reclassify symptom severity to something much higher than minor. While there is a workaround, it's extremely frustrating to users and it requires pager duty to get a runner unstuck, blocking the deploy process while it gets manually fixed. It happens relatively frequently (once every 40-50 builds in our experience so far). 

            Dejan Čabrilo added a comment - I would like to appeal to reclassify symptom severity to something much higher than minor. While there is a workaround, it's extremely frustrating to users and it requires pager duty to get a runner unstuck, blocking the deploy process while it gets manually fixed. It happens relatively frequently (once every 40-50 builds in our experience so far). 

            Hi, As I mentioned here https://community.atlassian.com/t5/Bitbucket-questions/Bitbucket-self-hosted-runner-failed-to-create-shim-task-OCI/qaq-p/2459381#U2685295 this issue happens usually on most unfortunate times, do we expect tier 1 support logging in into runners that are usually deal with  escalated privileged automation scripts? I wouldn't consider as low priority as it forces us to compromise security just to be able to use this feature.  

            Jefferson Fermo added a comment - Hi, As I mentioned here https://community.atlassian.com/t5/Bitbucket-questions/Bitbucket-self-hosted-runner-failed-to-create-shim-task-OCI/qaq-p/2459381#U2685295 this issue happens usually on most unfortunate times, do we expect tier 1 support logging in into runners that are usually deal with  escalated privileged automation scripts? I wouldn't consider as low priority as it forces us to compromise security just to be able to use this feature.  

            Franz QT added a comment -

            Any update on this? This is really frustrating and is affecting teams across our organization. 

            Franz QT added a comment - Any update on this? This is really frustrating and is affecting teams across our organization. 

              Unassigned Unassigned
              57b7f67f3625 Ben
              Affected customers:
              32 This affects my team
              Watchers:
              30 Start watching this issue

                Created:
                Updated: