Uploaded image for project: 'Jira Software Data Center'
  1. Jira Software Data Center
  2. JSWSERVER-16057

LexoRank rebalancing may stuck in some cases

XMLWordPrintable

    • 7.02
    • 69
    • Severity 2 - Major
    • 14
    • Hide
      Atlassian Update – 16 September 2024

      Thank you for reporting this issue. We have been working on fixing it in the last weeks. We're happy to announce that this issue is fixed in the 10.2.0 release.

      Summary of the problem:

      During rebalancing, a new issue rank needs to be reindexed. This operation is done asynchronously. Both processes share a queue of issues to be reindexed. After an issue is rebalanced, it is put into the queue and then taken up by the reindex process. The queue has a maximum limit of 64 issues. In case the reindex process fails, the queue is full after some time. Hence, rebalance operations used to wait an infinite amount of time to put a new item into an already full queue. Queue that was never consumed due to reindex failure.

      On top of that, errors/exceptions in the reindex process were never propagated or logged into log files. Hence, any errors with reindexing are lost, and we cannot trace why or how it can fail.

      New behaviour after the change:

      1. Since now, the rebalancing thread will not wait an infinite amount of time to put a new item into the queue. We added a timeout of 5 seconds per issue (a total of 320 seconds for the whole queue). If the rebalancing process hits the timeout, it will fail with an error: `Indexing Queue for rebalanced issues overflow, reindex stuck or is taking longer than expected > PT320s`.

      There is a small possibility that your process of reindexing for 1 issue is longer than 5 seconds. It means that probably (time_to_rebalance <<<< time_to_reindex) rebalancing process is much faster than reindexing. In that case, rebalancing could fail because it will hit the timeout. In such scenario, you can change `lexorank.rank.balancer.time.to.finish.reindexing.in.seconds` property. Remember that the time presents a timeout per 1 issue. If you change the property to 10s, effectively rebalancing process will wait for reindexing for 640s (property_timeout*size_of_queue=10s*64=640s).
      Changing timeouts for rebalancing should be rarely or never used. It should be changed only as a last resort. We encourage you to investigate the issue with the reindexing time. Time to reindex 1 issue longer than 5s, in our opinion, represents unhealthy instances.

      2. Additionally, errors/exceptions in the reindex process should be propagated and properly logged in. In case of errors, you can observe such logs:

      • `Indexing failed for rebalanced issues. Errors:` + remaining part of the error.
      • `LexoRank balancer failed,` + remaining part of the error.

      As we were not able to identify why reindexing was failing due to hidden errors, we have created a new bug. If you encounter an issue with rebalancing failing with the errors described above, please contact our support or share your errors on this bug → JSWSERVER-26104

      3. Be aware that during our investigation, we found an issue with the LexoRank management tab. No errors with rebalancing are propagated to the UI. More about that in this bug → JSWSERVER-26105

      Best regards.

      Daniel Dudziak
      Principal Software Engineer

      Show
      Atlassian Update – 16 September 2024 Thank you for reporting this issue. We have been working on fixing it in the last weeks. We're happy to announce that this issue is fixed in the 10.2.0 release. Summary of the problem: During rebalancing, a new issue rank needs to be reindexed. This operation is done asynchronously. Both processes share a queue of issues to be reindexed. After an issue is rebalanced, it is put into the queue and then taken up by the reindex process. The queue has a maximum limit of 64 issues. In case the reindex process fails , the queue is full after some time. Hence, rebalance operations used to wait an infinite amount of time to put a new item into an already full queue. Queue that was never consumed due to reindex failure. – On top of that, errors/exceptions in the reindex process were never propagated or logged into log files. Hence, any errors with reindexing are lost, and we cannot trace why or how it can fail. New behaviour after the change: 1. Since now, the rebalancing thread will not wait an infinite amount of time to put a new item into the queue. We added a timeout of 5 seconds per issue (a total of 320 seconds for the whole queue). If the rebalancing process hits the timeout, it will fail with an error: ` Indexing Queue for rebalanced issues overflow, reindex stuck or is taking longer than expected > PT320s` . There is a small possibility that your process of reindexing for 1 issue is longer than 5 seconds. It means that probably ( time_to_rebalance <<<< time_to_reindex ) rebalancing process is much faster than reindexing. In that case, rebalancing could fail because it will hit the timeout. In such scenario, you can change ` lexorank.rank.balancer.time.to.finish.reindexing.in.seconds ` property. Remember that the time presents a timeout per 1 issue. If you change the property to 10s , effectively rebalancing process will wait for reindexing for 640s ( property_timeout*size_of_queue=10s*64=640s ). Changing timeouts for rebalancing should be rarely or never used. It should be changed only as a last resort . We encourage you to investigate the issue with the reindexing time. Time to reindex 1 issue longer than 5s, in our opinion, represents unhealthy instances. – 2. Additionally, errors/exceptions in the reindex process should be propagated and properly logged in. In case of errors, you can observe such logs: ` Indexing failed for rebalanced issues. Errors: ` + remaining part of the error. ` LexoRank balancer failed, ` + remaining part of the error. As we were not able to identify why reindexing was failing due to hidden errors, we have created a new bug. If you encounter an issue with rebalancing failing with the errors described above, please contact our support or share your errors on this bug → JSWSERVER-26104 – 3. Be aware that during our investigation, we found an issue with the LexoRank management tab. No errors with rebalancing are propagated to the UI. More about that in this bug → JSWSERVER-26105 Best regards. Daniel Dudziak Principal Software Engineer

      Summary

      LexoRank rebalancing job may stuck in some cases. You don't see any progress for many hours (days).
      You can see status at Lexorank management page (see LexoRank Management)

      • Example of REST call /rest/greenhopper/1.0/lexorank/balance output:
        {"lexoRankBalancingServiceStatus":
        {"balancingDisabled":false,"balancingSuspended":false,"balanceHandlerRunning":true},
        "lexoRankBalancerStatus":{"balancerLocked":true,
        "perFieldStatus":[{"fieldName":"Rank","fieldId":11080,"numRankedIssues":2846758,"percentComplete":98.12,
        "distribution":[2793228,0,53530]}]},
        "totalIssueCount":2846034}
        
      • Note progress stuck at 98%.
      • Please note that JSWSERVER-15703 introduces the back-off mechanism for JIRA DC, so this is expected that Lexorank Rebancing pauses under high load:

        Whenever we detect current node is behind any other node in the cluster by 30 seconds or any other node in the cluster is behind current node by 30 seconds, LexoRank balancing will back-off.

      Expected Results

      Lexorank Rebalancing job is running and has progress.

      Actual Results

      Lexorank Rebalancing job is stuck and has no progress.
      User impact: Lexorank Rebalancing fixes issues with "broken" Rank, if current progress did not fix those broken issues so far, user will not able to rank them.

      Notes

      • Thread dump generated at that time will show the lexorank-executor-thread-0 is WAITING for a long time:
        lexorank-executor-thread-0" #544 prio=5 os_prio=0 tid=0x00007f331c11a800 nid=0xaff5 waiting on condition [0x00007f3236dca000]
           java.lang.Thread.State: WAITING (parking)
        	at sun.misc.Unsafe.park(Native Method)
        	- parking to wait for  <0x00007f3706dffcc8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        	at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
        	at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer.balanceFieldId(LexoRankBalancer.java:363)
        	at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer.lambda$balanceFieldIds$1(LexoRankBalancer.java:112)
        	at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer.dt_access$782(LexoRankBalancer.java)
        	at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer$$dtt$$Lambda$1518/708308671.run(Unknown Source)
        	at com.atlassian.greenhopper.global.PerformanceLogger$EmptyPerformanceLogger.measure(PerformanceLogger.java:129)
        	at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer.balanceFieldIds(LexoRankBalancer.java:109)
        	at com.atlassian.greenhopper.service.lexorank.balance.LexoRankScheduledBalanceHandler.run(LexoRankScheduledBalanceHandler.java:60)
        	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        	at java.lang.Thread.run(Thread.java:745)
        
      • Enabling DEBUG logging Lexorank balance (com.atlassian.greenhopper.service.lexorank.balance) doesn't show any progress
        • Expected logging:
          2017-08-25 12:34:24,234 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalancer] Balancing next rank row for rank field [id=10400]
          2017-08-25 12:34:24,234 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Balancing next rank row to migrate from bucket[0] to bucket[1] for rank field[id=10400]
          2017-08-25 12:34:24,235 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Fetched rows at balance boundary
          2017-08-25 12:34:24,235 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Fetched rows at balance boundary
          2017-08-25 12:34:24,235 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation]       rowToMigrate : LexoRankRow{id=9174, fieldId=10400, issueId=20443, lockHash='null', lockTime=null, bucket=0, rank='0|i01itb:', type=ISSUE_RANK_ROW}
          2017-08-25 12:34:24,235 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation]       rowLastMigrated : LexoRankRow{id=9175, fieldId=10400, issueId=20444, lockHash='null', lockTime=null, bucket=1, rank='1|hzzstz:', type=ISSUE_RANK_ROW}
          2017-08-25 12:34:24,241 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Acquired lock on rows
          2017-08-25 12:34:24,242 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Balancing rank row [type=ISSUE_RANK_ROW, oldRank=0|i01itb:, newRank=1|hzzstr:]
          2017-08-25 12:34:24,248 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Successfully saved rank row
          2017-08-25 12:34:24,248 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Releasing lock
          2017-08-25 12:34:24,255 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalancer] ReIndexing issue[id=20443]
          
      • See KB for more details regarding rebalancing: Troubleshooting New Ranking System Issues

      Workaround

      Restart JIRA

              ddudziak Stasiu
              ayakovlev@atlassian.com Andriy Yakovlev [Atlassian]
              Votes:
              22 Vote for this issue
              Watchers:
              33 Start watching this issue

                Created:
                Updated:
                Resolved: