Uploaded image for project: 'Bitbucket Data Center'
  1. Bitbucket Data Center
  2. BSERV-13270

Mesh: Support Mesh nodes being deployed to multiple Availability Zones

XMLWordPrintable

    • Icon: Suggestion Suggestion
    • Resolution: Unresolved
    • None
    • Mesh
    • None
    • 150
    • We collect Bitbucket feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

      For the initial (Bitbucket 8.0) release of Mesh we will not be supporting deploying Mesh nodes into multiple availability zones. Just like the NFS based deployments, the Mesh nodes (i.e. the repository storage) and the application nodes must be co-located. Support for deploying Mesh nodes (and application nodes) across multiple availability zones should be added in order to increase availability.

      Fundamentally support for this comes in two parts:

      • Ensuring the additional latency incurred when communicating between availability zones is tolerated; and
      • Replica placement is availability zones aware such that replicas are distributed across availability zones in a way that results in the desired availability
      Update September 2024

      We are happy to be able to confirm that we have determined that the additional latency incurred when Mesh nodes are deployed in multiple AWS availability zones is acceptable. As such deploying Mesh nodes across multiple AWS availability zones is a supported configuration (but not across regions). Other cloud environments have not been tested, but most cloud providers have the concept of an availability zone and region (sometimes using different terminology), and we'd expect their analogue of availability zones would be viable too, and again, cross-region deployed would not be viable. As a rough guideline, we would expect if the round trip latency (e.g. measured by ping) between the Mesh nodes and the nodes running the core Bitbucket application is under 5ms then this would also be a viable configuration.

      The above statement applies to all versions of Bitbucket Mesh, going back to Bitbucket 8.0.

      It is also possible to run the core Bitbucket application nodes in different availability-zones, different to each other, and different to the Mesh nodes. However this is only permitted once all repositories have been migrated off the NFS storage and on to Mesh nodes. If any repositories remain on NFS, all core Bitbucket application nodes must be hosted in the same availability zone as the NFS server.

      So that leaves the fault tolerance and replica placement topic for this ticket. It is possible to manually place nodes such that fault tolerance is achieved. Specifically, nodes must be distributed between availability zones so that the loss of one availability zone leaves a sufficient number of replicas for a write to succeed. The writes succeed on a quorum of replicas, where “n” is the replication factor a quorum of (n/2 + 1) replicas must be available. Note that the result of the division should be rounded down. For example, with a replication factor of three, a minimum of two replicas must be present for a write to succeed. In a simple scenario with a replication factor of three and three Mesh nodes each in separate availability zones, it’s easy to check how an outage in one availability zone would still result in two replicas being available.

      However consider 9 Mesh nodes, equally distributed over 3 availability zone, with a replication factor of 3. In this case all three replicas could be placed in the same availability zone. A complete availability zone outage would render the repository completely inaccessible.

      For for a more detailed discussion of the above, with pictures, please see the Multiple availability zone deployments section of the Bitbucket Mesh Whitepaper.

      This ticket will remain open to track future improvements to Bitbucket Mesh such that placement is availability zone aware such that replicas are distributed across availability zones in a way that means the above 9 node deployment (and similar deployments) would deliver fault tolerance.

              ckochovski@atlassian.com Christopher Kochovski
              behumphreys Ben Humphreys
              Votes:
              23 Vote for this issue
              Watchers:
              38 Start watching this issue

                Created:
                Updated: