Uploaded image for project: 'Bitbucket Data Center'
  1. Bitbucket Data Center
  2. BSERV-13300

Mesh: support large, active repositories

    XMLWordPrintable

Details

    • Suggestion
    • Resolution: Fixed
    • 8.9.0
    • Mesh
    • None
    • 5
    • We collect Bitbucket feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

    Description

      Update July 18th 2023

      We have closed this issue as Mesh is now ready for large, active repositories. Under certain circumstances, Repository replica repair can still take a long time to finish and we have tracked this improvement in BSERV-14260.

      Kind regards,
      Wolfgang Kritzinger
      Bitbucket Data Center

      In the initial (Bitbucket 8.0) release of Mesh large, active repositories (with more than 10,000 refs and/or greater than 4GB in size) are not yet recommended.

      There are two areas that need to be improved to support such large, active repositories:

      Initial garbage collection after migration to Mesh

      Mesh performs garbage collection differently from Bitbucket. After repositories are migrated, Mesh garbage collection may unpack a significant number of unreachable objects to prepare for them to be pruned. This can significantly increase disk usage and inode usage and, in extreme cases, may exhaust the file system. Large, active repositories with long histories are especially likely to suffer from this issue.

      The Mesh migration process should be improved to only migrate reachable objects, thereby avoiding these garbage collection issues.

      Repository replica repair failing to catch up

      Mesh delivers scalability and high availability by maintaining multiple replicas of each repository. Any write operation, such as a push, is replicated to each of the replicas to keep then in sync. Consistency is maintained by only allowing write operations to complete if they are successfully replicated to the majority of nodes, and only allowing nodes that have an up-to-date replica to service requests for the repository in question.

      Should a node be unavailable (e.g. offline) during a write but the write can still complete successfully, that node's replica is marked as inconsistent. When the node becomes available once more, the replica will be automatically repaired. The time required for this repair depends on the number of refs and reflogs in the repository, and the number and size of objects that needs to be fetched to bring the replica back in sync.

      If another write completes while the replica is being repairing, the replica still won't have fully caught up after the repair process completes and a new repair attempt will be started. In really active repositories with a high rate of change, this may lead to an out-of-date replica continuously repairing but not fully catching up for a long time. During this time the out-of-date replica won't be able to service requests for the repository, leaving the remaining replicas to service all requests for that repository, thereby leading to reduced capacity and resilience for that repository.

      Two complementary improvements are being considered:

      • Improve repair performance by supporting incremental repair: record the write operations that the replica failed to apply, and replay those instead of doing a 'full' repair from an up-to-date replica.
      • When replicating a write operation to a replica that is being repaired, wait for up to X seconds for the repair to complete before concluding that the replica can't participate in the write.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mheemskerk Michael Heemskerk (Inactive)
              Votes:
              16 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: