Uploaded image for project: 'Bitbucket Data Center'
  1. Bitbucket Data Center
  2. BSERV-14260

Mesh: Repository replica repair failing to catch up


    • Icon: Suggestion Suggestion
    • Resolution: Unresolved
    • None
    • Mesh
    • None
    • 1
    • We collect Bitbucket feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

      Mesh delivers scalability and high availability by maintaining multiple replicas of each repository. Any write operation, such as a push, is replicated to each of the replicas to keep then in sync. Consistency is maintained by only allowing write operations to complete if they are successfully replicated to the majority of nodes, and only allowing nodes that have an up-to-date replica to service requests for the repository in question.

      Should a node be unavailable (e.g. offline) during a write but the write can still complete successfully, that node's replica is marked as inconsistent. When the node becomes available once more, the replica will be automatically repaired. The time required for this repair depends on the number of refs and reflogs in the repository, and the number and size of objects that needs to be fetched to bring the replica back in sync.

      If another write completes while the replica is being repairing, the replica still won't have fully caught up after the repair process completes and a new repair attempt will be started. In really active repositories with a high rate of change, this may lead to an out-of-date replica continuously repairing but not fully catching up for a long time. During this time the out-of-date replica won't be able to service requests for the repository, leaving the remaining replicas to service all requests for that repository, thereby leading to reduced capacity and resilience for that repository.

      Two complementary improvements are being considered:

      • Improve repair performance by supporting incremental repair: record the write operations that the replica failed to apply, and replay those instead of doing a 'full' repair from an up-to-date replica.
      • When replicating a write operation to a replica that is being repaired, wait for up to X seconds for the repair to complete before concluding that the replica can't participate in the write.

            Unassigned Unassigned
            wkritzinger Wolfgang Kritzinger
            0 Vote for this issue
            3 Start watching this issue