Mesh does not yet support self-healing after disaster recovery. When Mesh nodes are recovered from snapshots, repositories that received one or more writes in the time window covering the snapshot times of the Mesh nodes may have inconsistent replicas on different nodes. Some writes could be captured in the snapshot of one Mesh node, but not in the other, leading to an inconsistent state. Repositories that did not receive writes in the "snapshot window" that the Mesh nodes were restored from are unaffected by this issue.
Symptoms
If the repository is in such an inconsistent state, the following problems may occur:
- git processes, such as cloning a repository or listing a repository's refs, can return different results depending on what backing Mesh node services the request until the inconsistent replica(s) are repaired.
- a successful write (e.g. a push to the repository) will trigger the repository to be repaired to the state that the majority of the replicas have, even if the minority has a newer version of the repository. As a result, a write that was only captured in the newest of the snapshots may be lost.
- writes may fail and continue to fail if all three replicas captured a different state of the repository.
Workarounds
If a repository is suspected to be in an inconsistent state because a push is failing with the error "Ref update could not be replicated" or "error: remote unpack failed: unpack-objects abnormal exit", try to push a new branch or tag to the repository. This push may still fail, but will trigger the outdated replicas to be marked as inconsistent and subsequently repaired.
Planned improvements
Enhance Bitbucket Server to scan for such inconsistencies automatically upon recovery. For any detected inconsistencies, mark the outdated replica(s) as inconsistent to prevent requests from being sent to them, and schedule repairs.