Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-63842

LexoRank Rebalance causes read/write amplifications on Lucene in JIRA Datacenter

XMLWordPrintable

      Summary

      In JIRA datacenter, LexoRank Rebalance causes read/write amplifications on Lucene which may cause performance degradation if cluster doesn't have IO/CPU capacity.

      Environment

      • JIRA Datacenter
      • Large number of issues 1M+
      • Large number of custom fields: 1k+

      Steps to Reproduce

      1. Setup JIRA Datacenter
      2. Trigger or wait for LexoRank Rebalance

      Expected Results

      JIRA Datacenter performance will be not affected and no replication lag

      Actual Results

      JIRA Datacenter performance could be affected and there will be replication lag. That will cause data discrepancy between nodes.
      You will have following health-check error:

      ["Index replication for cluster node 'node3' is behind by 2,991 seconds.","Index replication for cluster node 'node1' is behind by 1,501 seconds.","Index replication for cluster node 'node2_0004' is behind by 2,123 seconds."]
      

      Notes

      Problem is caused by set of conditions/problems:

      • LexoRank Rebalancing requires rebalancing of all records in Rank field
      • In case of JDC that required reindexing of all issues with all related customfields/comments
      • That causes read and write amplification, as all nodes needs to update their Lucene index for all issues
      • JDC uses same replicatedindexoperation mechanism for all updates.
        • That means that critical replication updates from at other nodes initiated by user action compete with non-urgent LexoRank updates.

      Workarounds

      • Reduce the number of nodes in the cluster. Testing reveals diminishing returns in performance in clusters larger than 4 nodes.
      • Avoid closing a sprint with > 200 issues in an unresolved status. This requires a new Rank value for all these issues and can trigger a Lexorank rebalance.
      • If you are planning to import hundreds of issues, delay this until you have tested and resolved performance bottlenecks. A Rank must be generated for every issue and this can trigger a Lexorank rebalance.
      • Leave only one node in LB to prevent serving stale data from other nodes. This negates the high availability value of Data Center so is considered a last resort. This also requires that each node is capable of handling the full concurrent user traffic for your organization, as is the best practice for an HA cluster.

      Full details on workarounds and solutions are available at JIRAKB/JIRA Software Data Center Lexorank Indexing Lag.

      Note on Fix

      Problem mitigation:

      • We have worked on reducing the need of LexoRank balancing being triggered (JSW-15710).
      • Also a number of improvements to LexoRank balancing has been implemented that reduce impact of it running on JIRA cluster.
        • We have addressed replication lag in JSW-15703.

      That said, a running LexoRank balancing will still cause some read and write amplification. More details about resolution are available at JIRAKB/JIRA Software Data Center Lexorank Indexing Lag.

              Unassigned Unassigned
              ayakovlev@atlassian.com Andriy Yakovlev [Atlassian]
              Votes:
              25 Vote for this issue
              Watchers:
              56 Start watching this issue

                Created:
                Updated:
                Resolved:

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h