When a repository fork is created Bitbucket runs the command git clone --bare --shared.... By default commands executed are subject to the following two timeouts:
An idle timeout that configures how long the command is allowed to run without producing any output:
By default a process that has not produced any output (on stdout/stderr) or consumed any input (on stdin) for 60 seconds will be terminated.
Additionally a process execution timeout exists. This configures a hard upper limit on how long the command is allowed to run even if it is producing output.
By default a process will be terminated after 120 second.
In the case of fork creation, if it were to run too long we'd expect it to be terminated after (by default) 120 seconds. However because the command does not produce output as it is processing certain things (e.g. lots of refs) the system will assume the process is idle after 60 seconds and will terminate it.
To fix this we should look at not doing idle detection for the fork command, and instead just rely on the execution timeout. This will result in the command being terminated after (by default) 120 seconds, and not 60 seconds.
This problem was identified on a customer system where fork creation was slow due to a large number of tags & branches (~30,000). The problem is exacerbated by filesystem performance, and is going to be more common with NFS based storage where I/O latencies are typically between 200us to 1ms.
- Host Bitbucket on slow storage (e.g. use a slow NFS server - AWS EFS is a good option to reproduce this)
- Create a repository with many (e.g. 30,000) branches or tags
- Use the Bitbucket UI to fork the repository
I've tested this on 6.10 (as the oldest supported version currently) however the issue goes back further than that.
Fork process will be terminated after 120 seconds if it has not completed
Fork process will be terminated after 60 seconds if it has not completed
In the bitbucket.properties file increase the idle timeout to match the process execution timeout
Somewhat unrelated to the problem this bug describes, but.... in cases where it takes longer than 120 seconds it may be also necessary to increase the process.timeout.execution, for example:
This should however not be set arbitrarily high as it is a system protection mechanism and is designed to protect the system for unexpectedly long running processes confusing resources. In the case of the scenario described above, with large numbers of refs and slow NFS, a much better solution is to improve NFS performance and/or decrease the number of refs by deleting some branches or tags.