-
Bug
-
Resolution: Fixed
-
Medium (View bug fix roadmap)
-
7.3.0, 7.4.0, 7.2.8
-
7.02
-
69
-
Severity 2 - Major
-
14
-
-
Summary
LexoRank rebalancing job may stuck in some cases. You don't see any progress for many hours (days).
You can see status at Lexorank management page (see LexoRank Management)
- Example of REST call /rest/greenhopper/1.0/lexorank/balance output:
{"lexoRankBalancingServiceStatus": {"balancingDisabled":false,"balancingSuspended":false,"balanceHandlerRunning":true}, "lexoRankBalancerStatus":{"balancerLocked":true, "perFieldStatus":[{"fieldName":"Rank","fieldId":11080,"numRankedIssues":2846758,"percentComplete":98.12, "distribution":[2793228,0,53530]}]}, "totalIssueCount":2846034}
- Note progress stuck at 98%.
- Please note that
JSWSERVER-15703introduces the back-off mechanism for JIRA DC, so this is expected that Lexorank Rebancing pauses under high load:Whenever we detect current node is behind any other node in the cluster by 30 seconds or any other node in the cluster is behind current node by 30 seconds, LexoRank balancing will back-off.
Expected Results
Lexorank Rebalancing job is running and has progress.
Actual Results
Lexorank Rebalancing job is stuck and has no progress.
User impact: Lexorank Rebalancing fixes issues with "broken" Rank, if current progress did not fix those broken issues so far, user will not able to rank them.
Notes
- Thread dump generated at that time will show the lexorank-executor-thread-0 is WAITING for a long time:
lexorank-executor-thread-0" #544 prio=5 os_prio=0 tid=0x00007f331c11a800 nid=0xaff5 waiting on condition [0x00007f3236dca000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f3706dffcc8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350) at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer.balanceFieldId(LexoRankBalancer.java:363) at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer.lambda$balanceFieldIds$1(LexoRankBalancer.java:112) at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer.dt_access$782(LexoRankBalancer.java) at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer$$dtt$$Lambda$1518/708308671.run(Unknown Source) at com.atlassian.greenhopper.global.PerformanceLogger$EmptyPerformanceLogger.measure(PerformanceLogger.java:129) at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer.balanceFieldIds(LexoRankBalancer.java:109) at com.atlassian.greenhopper.service.lexorank.balance.LexoRankScheduledBalanceHandler.run(LexoRankScheduledBalanceHandler.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
- Enabling DEBUG logging Lexorank balance (com.atlassian.greenhopper.service.lexorank.balance) doesn't show any progress
- Expected logging:
2017-08-25 12:34:24,234 lexorank-executor-thread-0 DEBUG [c.a.g.s.lexorank.balance.LexoRankBalancer] Balancing next rank row for rank field [id=10400] 2017-08-25 12:34:24,234 lexorank-executor-thread-0 DEBUG [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Balancing next rank row to migrate from bucket[0] to bucket[1] for rank field[id=10400] 2017-08-25 12:34:24,235 lexorank-executor-thread-0 DEBUG [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Fetched rows at balance boundary 2017-08-25 12:34:24,235 lexorank-executor-thread-0 DEBUG [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Fetched rows at balance boundary 2017-08-25 12:34:24,235 lexorank-executor-thread-0 DEBUG [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] rowToMigrate : LexoRankRow{id=9174, fieldId=10400, issueId=20443, lockHash='null', lockTime=null, bucket=0, rank='0|i01itb:', type=ISSUE_RANK_ROW} 2017-08-25 12:34:24,235 lexorank-executor-thread-0 DEBUG [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] rowLastMigrated : LexoRankRow{id=9175, fieldId=10400, issueId=20444, lockHash='null', lockTime=null, bucket=1, rank='1|hzzstz:', type=ISSUE_RANK_ROW} 2017-08-25 12:34:24,241 lexorank-executor-thread-0 DEBUG [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Acquired lock on rows 2017-08-25 12:34:24,242 lexorank-executor-thread-0 DEBUG [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Balancing rank row [type=ISSUE_RANK_ROW, oldRank=0|i01itb:, newRank=1|hzzstr:] 2017-08-25 12:34:24,248 lexorank-executor-thread-0 DEBUG [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Successfully saved rank row 2017-08-25 12:34:24,248 lexorank-executor-thread-0 DEBUG [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Releasing lock 2017-08-25 12:34:24,255 lexorank-executor-thread-0 DEBUG [c.a.g.s.lexorank.balance.LexoRankBalancer] ReIndexing issue[id=20443]
- Expected logging:
- See KB for more details regarding rebalancing: Troubleshooting New Ranking System Issues
Workaround
Restart JIRA
- is duplicated by
-
JSWSERVER-16072 LexoRank can freeze when it fails to finish IndexingRunnable
-
- Closed
-
- is related to
-
JSWSERVER-16542 During rebalancing ranking operation fails to save new rank due to slow DB operation
-
- Closed
-
-
JSWSERVER-26104 Rebalance can fail because of issue with reindexing
-
- Gathering Impact
-
- relates to
-
JSWSERVER-20691 Lexorank affects issue creation time
-
- Closed
-
-
JSWSERVER-26104 Rebalance can fail because of issue with reindexing
-
- Gathering Impact
-
-
JSWSERVER-26105 Rebalancing errors are silenced and not propagated to UI
-
- Gathering Impact
-
-
JSWDC-21 Loading...
- Mentioned in
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
[JSWSERVER-16057] LexoRank rebalancing may stuck in some cases
Resolution | New: Fixed [ 1 ] | |
Status | Original: Waiting for Release [ 12075 ] | New: Closed [ 6 ] |
UIS | Original: 28 | New: 14 |
UIS | Original: 42 | New: 28 |
UIS | Original: 41 | New: 42 |
UIS | Original: 42 | New: 41 |
Current Status |
Original:
{panel:title=Atlassian Update – 16 September 2024|borderStyle=solid|borderColor=#deebff|titleBGColor=#deebff|bgColor=#deebff}
Thank you for reporting this issue. We have been working on fixing it in the last weeks. We're happy to announce that this issue is fixed in the +10.2.0+ release. *Summary of the problem:* During rebalancing, a new issue rank needs to be reindexed. This operation is done asynchronously. Both processes share a queue of issues to be reindexed. After an issue is rebalanced, it is put into the queue and then taken up by the reindex process. The queue has a maximum limit of 64 issues. In case the reindex process {*}fails{*}, the queue is full after some time. Hence, rebalance operations used to wait an *infinite* amount of time to put a new item into an already full queue. Queue that was never consumed due to reindex failure. – On top of that, errors/exceptions in the reindex process were never propagated or logged into log files. Hence, any errors with reindexing are lost, and we cannot trace why or how it can failed. *New behaviour after the change:* 1. Since now, the rebalancing thread will not wait an *infinite* amount of time to put a new item into the queue. We added a timeout of *5* seconds per issue (a total of *320* seconds for the whole queue). If the rebalancing process hits the timeout, it will fail with an error: `{_}Indexing Queue for rebalanced issues overflow, reindex stuck or is taking longer than expected > PT320s`{_}. There is a small possibility that your process of reindexing for *1* issue is longer than *5* seconds. It means that probably ({_}time_to_rebalance <<<< time_to_reindex{_}) rebalancing process is much faster than reindexing. In that case, rebalancing could fail because it will hit the timeout. In such scenario, you can change `{_}lexorank.rank.balancer.time.to.finish.reindexing.in.seconds{_}` property. Remember that the time presents a timeout *per 1* issue. If you change the property to {+}10s{+}, effectively rebalancing process will wait for reindexing for *640s* ({_}property_timeout*size_of_queue=10s*64=640s{_}). Changing timeouts for rebalancing should be *rarely or never* used. It should be changed only as a *last* {*}resort{*}. We *+encourage+* you to investigate the issue with the reindexing time. Time to reindex 1 issue longer than 5s, in our opinion, represents +unhealthy+ instances. – 2. Additionally, errors/exceptions in the reindex process should be propagated and properly logged in. In case of errors, you can observe such logs: * `{_}Indexing failed for rebalanced issues. Errors:{_}` + remaining part of the error. * `{_}LexoRank balancer failed,{_}` + remaining part of the error. As we were not able to identify why reindexing was failing due to hidden errors, we have created a new bug. If you encounter an issue with rebalancing failing with the errors described above, please contact our support or share your errors on this bug → [JSWSERVER-26104|https://jira.atlassian.com/browse/JSWSERVER-26104] – 3. Be aware that during our investigation, we found an issue with the LexoRank management tab. No errors with rebalancing are propagated to the UI. More about that in this bug → [JSWSERVER-26105|https://jira.atlassian.com/browse/JSWSERVER-26105] Best regards. Daniel Dudziak Principal Software Engineer {panel} |
New:
{panel:title=Atlassian Update – 16 September 2024|borderStyle=solid|borderColor=#deebff|titleBGColor=#deebff|bgColor=#deebff}
Thank you for reporting this issue. We have been working on fixing it in the last weeks. We're happy to announce that this issue is fixed in the +10.2.0+ release. *Summary of the problem:* During rebalancing, a new issue rank needs to be reindexed. This operation is done asynchronously. Both processes share a queue of issues to be reindexed. After an issue is rebalanced, it is put into the queue and then taken up by the reindex process. The queue has a maximum limit of 64 issues. In case the reindex process {*}fails{*}, the queue is full after some time. Hence, rebalance operations used to wait an *infinite* amount of time to put a new item into an already full queue. Queue that was never consumed due to reindex failure. – On top of that, errors/exceptions in the reindex process were never propagated or logged into log files. Hence, any errors with reindexing are lost, and we cannot trace why or how it can fail. *New behaviour after the change:* 1. Since now, the rebalancing thread will not wait an *infinite* amount of time to put a new item into the queue. We added a timeout of *5* seconds per issue (a total of *320* seconds for the whole queue). If the rebalancing process hits the timeout, it will fail with an error: `{_}Indexing Queue for rebalanced issues overflow, reindex stuck or is taking longer than expected > PT320s`{_}. There is a small possibility that your process of reindexing for *1* issue is longer than *5* seconds. It means that probably ({_}time_to_rebalance <<<< time_to_reindex{_}) rebalancing process is much faster than reindexing. In that case, rebalancing could fail because it will hit the timeout. In such scenario, you can change `{_}lexorank.rank.balancer.time.to.finish.reindexing.in.seconds{_}` property. Remember that the time presents a timeout *per 1* issue. If you change the property to {+}10s{+}, effectively rebalancing process will wait for reindexing for *640s* ({_}property_timeout*size_of_queue=10s*64=640s{_}). Changing timeouts for rebalancing should be *rarely or never* used. It should be changed only as a *last* {*}resort{*}. We *+encourage+* you to investigate the issue with the reindexing time. Time to reindex 1 issue longer than 5s, in our opinion, represents +unhealthy+ instances. – 2. Additionally, errors/exceptions in the reindex process should be propagated and properly logged in. In case of errors, you can observe such logs: * `{_}Indexing failed for rebalanced issues. Errors:{_}` + remaining part of the error. * `{_}LexoRank balancer failed,{_}` + remaining part of the error. As we were not able to identify why reindexing was failing due to hidden errors, we have created a new bug. If you encounter an issue with rebalancing failing with the errors described above, please contact our support or share your errors on this bug → [JSWSERVER-26104|https://jira.atlassian.com/browse/JSWSERVER-26104] – 3. Be aware that during our investigation, we found an issue with the LexoRank management tab. No errors with rebalancing are propagated to the UI. More about that in this bug → [JSWSERVER-26105|https://jira.atlassian.com/browse/JSWSERVER-26105] Best regards. Daniel Dudziak Principal Software Engineer {panel} |
Current Status |
New:
{panel:title=Atlassian Update – 16 September 2024|borderStyle=solid|borderColor=#deebff|titleBGColor=#deebff|bgColor=#deebff}
Thank you for reporting this issue. We have been working on fixing it in the last weeks. We're happy to announce that this issue is fixed in the +10.2.0+ release. *Summary of the problem:* During rebalancing, a new issue rank needs to be reindexed. This operation is done asynchronously. Both processes share a queue of issues to be reindexed. After an issue is rebalanced, it is put into the queue and then taken up by the reindex process. The queue has a maximum limit of 64 issues. In case the reindex process {*}fails{*}, the queue is full after some time. Hence, rebalance operations used to wait an *infinite* amount of time to put a new item into an already full queue. Queue that was never consumed due to reindex failure. – On top of that, errors/exceptions in the reindex process were never propagated or logged into log files. Hence, any errors with reindexing are lost, and we cannot trace why or how it can failed. *New behaviour after the change:* 1. Since now, the rebalancing thread will not wait an *infinite* amount of time to put a new item into the queue. We added a timeout of *5* seconds per issue (a total of *320* seconds for the whole queue). If the rebalancing process hits the timeout, it will fail with an error: `{_}Indexing Queue for rebalanced issues overflow, reindex stuck or is taking longer than expected > PT320s`{_}. There is a small possibility that your process of reindexing for *1* issue is longer than *5* seconds. It means that probably ({_}time_to_rebalance <<<< time_to_reindex{_}) rebalancing process is much faster than reindexing. In that case, rebalancing could fail because it will hit the timeout. In such scenario, you can change `{_}lexorank.rank.balancer.time.to.finish.reindexing.in.seconds{_}` property. Remember that the time presents a timeout *per 1* issue. If you change the property to {+}10s{+}, effectively rebalancing process will wait for reindexing for *640s* ({_}property_timeout*size_of_queue=10s*64=640s{_}). Changing timeouts for rebalancing should be *rarely or never* used. It should be changed only as a *last* {*}resort{*}. We *+encourage+* you to investigate the issue with the reindexing time. Time to reindex 1 issue longer than 5s, in our opinion, represents +unhealthy+ instances. – 2. Additionally, errors/exceptions in the reindex process should be propagated and properly logged in. In case of errors, you can observe such logs: * `{_}Indexing failed for rebalanced issues. Errors:{_}` + remaining part of the error. * `{_}LexoRank balancer failed,{_}` + remaining part of the error. As we were not able to identify why reindexing was failing due to hidden errors, we have created a new bug. If you encounter an issue with rebalancing failing with the errors described above, please contact our support or share your errors on this bug → [JSWSERVER-26104|https://jira.atlassian.com/browse/JSWSERVER-26104] – 3. Be aware that during our investigation, we found an issue with the LexoRank management tab. No errors with rebalancing are propagated to the UI. More about that in this bug → [JSWSERVER-26105|https://jira.atlassian.com/browse/JSWSERVER-26105] Best regards. Daniel Dudziak Principal Software Engineer {panel} |
Status | Original: In Progress [ 3 ] | New: Waiting for Release [ 12075 ] |
Fix Version/s | New: 10.2.0 [ 108819 ] |
Link | New: This issue relates to JSWSERVER-26105 [ JSWSERVER-26105 ] |