[JSWSERVER-16057] LexoRank rebalancing may stuck in some cases

Type: Bug
Resolution: Fixed
Priority: Medium (View bug fix roadmap)
Fix Version/s: 10.2.0
Affects Version/s: 7.3.0, 7.4.0, 7.2.8
Component/s: Lexorank
Labels:

Introduced in Version:
7.02
Support reference count:
69
Symptom Severity:
Severity 2 - Major
UIS:
14
Bug Fix Policy:
View Atlassian Server bug fix policy
Current Status:
Hide

Atlassian Update – 16 September 2024

Thank you for reporting this issue. We have been working on fixing it in the last weeks. We're happy to announce that this issue is fixed in the 10.2.0 release.

Summary of the problem:

During rebalancing, a new issue rank needs to be reindexed. This operation is done asynchronously. Both processes share a queue of issues to be reindexed. After an issue is rebalanced, it is put into the queue and then taken up by the reindex process. The queue has a maximum limit of 64 issues. In case the reindex process fails, the queue is full after some time. Hence, rebalance operations used to wait an infinite amount of time to put a new item into an already full queue. Queue that was never consumed due to reindex failure.

–

On top of that, errors/exceptions in the reindex process were never propagated or logged into log files. Hence, any errors with reindexing are lost, and we cannot trace why or how it can fail.

New behaviour after the change:

1. Since now, the rebalancing thread will not wait an infinite amount of time to put a new item into the queue. We added a timeout of 5 seconds per issue (a total of 320 seconds for the whole queue). If the rebalancing process hits the timeout, it will fail with an error: `Indexing Queue for rebalanced issues overflow, reindex stuck or is taking longer than expected > PT320s`.

There is a small possibility that your process of reindexing for 1 issue is longer than 5 seconds. It means that probably (time_to_rebalance <<<< time_to_reindex) rebalancing process is much faster than reindexing. In that case, rebalancing could fail because it will hit the timeout. In such scenario, you can change `lexorank.rank.balancer.time.to.finish.reindexing.in.seconds` property. Remember that the time presents a timeout per 1 issue. If you change the property to 10s, effectively rebalancing process will wait for reindexing for 640s (property_timeout*size_of_queue=10s*64=640s).
Changing timeouts for rebalancing should be rarely or never used. It should be changed only as a last resort. We encourage you to investigate the issue with the reindexing time. Time to reindex 1 issue longer than 5s, in our opinion, represents unhealthy instances.

–

2. Additionally, errors/exceptions in the reindex process should be propagated and properly logged in. In case of errors, you can observe such logs:

`Indexing failed for rebalanced issues. Errors:` + remaining part of the error.

`LexoRank balancer failed,` + remaining part of the error.

As we were not able to identify why reindexing was failing due to hidden errors, we have created a new bug. If you encounter an issue with rebalancing failing with the errors described above, please contact our support or share your errors on this bug → JSWSERVER-26104

–

3. Be aware that during our investigation, we found an issue with the LexoRank management tab. No errors with rebalancing are propagated to the UI. More about that in this bug → JSWSERVER-26105

Best regards.

Daniel Dudziak
Principal Software Engineer
Show
Atlassian Update – 16 September 2024 Thank you for reporting this issue. We have been working on fixing it in the last weeks. We're happy to announce that this issue is fixed in the 10.2.0 release. Summary of the problem: During rebalancing, a new issue rank needs to be reindexed. This operation is done asynchronously. Both processes share a queue of issues to be reindexed. After an issue is rebalanced, it is put into the queue and then taken up by the reindex process. The queue has a maximum limit of 64 issues. In case the reindex process fails , the queue is full after some time. Hence, rebalance operations used to wait an infinite amount of time to put a new item into an already full queue. Queue that was never consumed due to reindex failure. – On top of that, errors/exceptions in the reindex process were never propagated or logged into log files. Hence, any errors with reindexing are lost, and we cannot trace why or how it can fail. New behaviour after the change: 1. Since now, the rebalancing thread will not wait an infinite amount of time to put a new item into the queue. We added a timeout of 5 seconds per issue (a total of 320 seconds for the whole queue). If the rebalancing process hits the timeout, it will fail with an error: ` Indexing Queue for rebalanced issues overflow, reindex stuck or is taking longer than expected > PT320s` . There is a small possibility that your process of reindexing for 1 issue is longer than 5 seconds. It means that probably ( time_to_rebalance <<<< time_to_reindex ) rebalancing process is much faster than reindexing. In that case, rebalancing could fail because it will hit the timeout. In such scenario, you can change ` lexorank.rank.balancer.time.to.finish.reindexing.in.seconds ` property. Remember that the time presents a timeout per 1 issue. If you change the property to 10s , effectively rebalancing process will wait for reindexing for 640s ( property_timeout*size_of_queue=10s*64=640s ). Changing timeouts for rebalancing should be rarely or never used. It should be changed only as a last resort . We encourage you to investigate the issue with the reindexing time. Time to reindex 1 issue longer than 5s, in our opinion, represents unhealthy instances. – 2. Additionally, errors/exceptions in the reindex process should be propagated and properly logged in. In case of errors, you can observe such logs: ` Indexing failed for rebalanced issues. Errors: ` + remaining part of the error. ` LexoRank balancer failed, ` + remaining part of the error. As we were not able to identify why reindexing was failing due to hidden errors, we have created a new bug. If you encounter an issue with rebalancing failing with the errors described above, please contact our support or share your errors on this bug → JSWSERVER-26104 – 3. Be aware that during our investigation, we found an issue with the LexoRank management tab. No errors with rebalancing are propagated to the UI. More about that in this bug → JSWSERVER-26105 Best regards. Daniel Dudziak Principal Software Engineer

Summary

LexoRank rebalancing job may stuck in some cases. You don't see any progress for many hours (days).
You can see status at Lexorank management page (see LexoRank Management)

Example of REST call /rest/greenhopper/1.0/lexorank/balance output:

{"lexoRankBalancingServiceStatus":
{"balancingDisabled":false,"balancingSuspended":false,"balanceHandlerRunning":true},
"lexoRankBalancerStatus":{"balancerLocked":true,
"perFieldStatus":[{"fieldName":"Rank","fieldId":11080,"numRankedIssues":2846758,"percentComplete":98.12,
"distribution":[2793228,0,53530]}]},
"totalIssueCount":2846034}

Note progress stuck at 98%.
Please note that ~~JSWSERVER-15703~~ introduces the back-off mechanism for JIRA DC, so this is expected that Lexorank Rebancing pauses under high load:
Whenever we detect current node is behind any other node in the cluster by 30 seconds or any other node in the cluster is behind current node by 30 seconds, LexoRank balancing will back-off.

Expected Results

Lexorank Rebalancing job is running and has progress.

Actual Results

Lexorank Rebalancing job is stuck and has no progress.
User impact: Lexorank Rebalancing fixes issues with "broken" Rank, if current progress did not fix those broken issues so far, user will not able to rank them.

Notes

Thread dump generated at that time will show the lexorank-executor-thread-0 is WAITING for a long time:

lexorank-executor-thread-0" #544 prio=5 os_prio=0 tid=0x00007f331c11a800 nid=0xaff5 waiting on condition [0x00007f3236dca000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00007f3706dffcc8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
	at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
	at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer.balanceFieldId(LexoRankBalancer.java:363)
	at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer.lambda$balanceFieldIds$1(LexoRankBalancer.java:112)
	at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer.dt_access$782(LexoRankBalancer.java)
	at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer$$dtt$$Lambda$1518/708308671.run(Unknown Source)
	at com.atlassian.greenhopper.global.PerformanceLogger$EmptyPerformanceLogger.measure(PerformanceLogger.java:129)
	at com.atlassian.greenhopper.service.lexorank.balance.LexoRankBalancer.balanceFieldIds(LexoRankBalancer.java:109)
	at com.atlassian.greenhopper.service.lexorank.balance.LexoRankScheduledBalanceHandler.run(LexoRankScheduledBalanceHandler.java:60)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Enabling DEBUG logging Lexorank balance (com.atlassian.greenhopper.service.lexorank.balance) doesn't show any progress

Expected logging:

2017-08-25 12:34:24,234 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalancer] Balancing next rank row for rank field [id=10400]
2017-08-25 12:34:24,234 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Balancing next rank row to migrate from bucket[0] to bucket[1] for rank field[id=10400]
2017-08-25 12:34:24,235 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Fetched rows at balance boundary
2017-08-25 12:34:24,235 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Fetched rows at balance boundary
2017-08-25 12:34:24,235 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation]       rowToMigrate : LexoRankRow{id=9174, fieldId=10400, issueId=20443, lockHash='null', lockTime=null, bucket=0, rank='0|i01itb:', type=ISSUE_RANK_ROW}
2017-08-25 12:34:24,235 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation]       rowLastMigrated : LexoRankRow{id=9175, fieldId=10400, issueId=20444, lockHash='null', lockTime=null, bucket=1, rank='1|hzzstz:', type=ISSUE_RANK_ROW}
2017-08-25 12:34:24,241 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Acquired lock on rows
2017-08-25 12:34:24,242 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Balancing rank row [type=ISSUE_RANK_ROW, oldRank=0|i01itb:, newRank=1|hzzstr:]
2017-08-25 12:34:24,248 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Successfully saved rank row
2017-08-25 12:34:24,248 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalanceOperation] Releasing lock
2017-08-25 12:34:24,255 lexorank-executor-thread-0 DEBUG      [c.a.g.s.lexorank.balance.LexoRankBalancer] ReIndexing issue[id=20443]

See KB for more details regarding rebalancing: Troubleshooting New Ranking System Issues

Workaround

Restart JIRA

is duplicated by

JSWSERVER-16072 LexoRank can freeze when it fails to finish IndexingRunnable

Closed

is related to

JSWSERVER-16542 During rebalancing ranking operation fails to save new rank due to slow DB operation

Closed

JSWSERVER-26104 Rebalance can fail because of issue with reindexing

Gathering Impact

relates to

JSWSERVER-20691 Lexorank affects issue creation time

Closed

JSWSERVER-26104 Rebalance can fail because of issue with reindexing

Gathering Impact

JSWSERVER-26105 Rebalancing errors are silenced and not propagated to UI

Gathering Impact

JSWDC-21 Loading...

Mentioned in

FY19 Jira DC L1L2 Bugs - Recommend to fix in 7.13 --> 7.6 where technically possible: [L effort] (7) Lexorank rebalancing stuck [JSW-16057]

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(2 relates to, 1 Mentioned in, 43 mentioned in)

Jira Server Release Automation Bot[ADM-107524] made changes - 20/Nov/2024 3:50 PM

Resolution		New: Fixed [ 1 ]
Status	Original: Waiting for Release [ 12075 ]	New: Closed [ 6 ]

SET Analytics Bot made changes - 01/Nov/2024 2:29 AM

UIS

Original: 28

New: 14

SET Analytics Bot made changes - 04/Oct/2024 2:36 AM

UIS

Original: 42

New: 28

SET Analytics Bot made changes - 20/Sep/2024 2:30 AM

UIS

Original: 41

New: 42

SET Analytics Bot made changes - 18/Sep/2024 2:32 AM

UIS

Original: 42

New: 41

Stasiu made changes - 16/Sep/2024 9:22 AM

Current Status

Original: {panel:title=Atlassian Update – 16 September 2024|borderStyle=solid|borderColor=#deebff|titleBGColor=#deebff|bgColor=#deebff}
Thank you for reporting this issue. We have been working on fixing it in the last weeks. We're happy to announce that this issue is fixed in the +10.2.0+ release.

*Summary of the problem:*

During rebalancing, a new issue rank needs to be reindexed. This operation is done asynchronously. Both processes share a queue of issues to be reindexed. After an issue is rebalanced, it is put into the queue and then taken up by the reindex process. The queue has a maximum limit of 64 issues. In case the reindex process {*}fails{*}, the queue is full after some time. Hence, rebalance operations used to wait an *infinite* amount of time to put a new item into an already full queue. Queue that was never consumed due to reindex failure.

–

On top of that, errors/exceptions in the reindex process were never propagated or logged into log files. Hence, any errors with reindexing are lost, and we cannot trace why or how it can failed.

*New behaviour after the change:*

1. Since now, the rebalancing thread will not wait an *infinite* amount of time to put a new item into the queue. We added a timeout of *5* seconds per issue (a total of *320* seconds for the whole queue). If the rebalancing process hits the timeout, it will fail with an error: `{_}Indexing Queue for rebalanced issues overflow, reindex stuck or is taking longer than expected > PT320s`{_}.

There is a small possibility that your process of reindexing for *1* issue is longer than *5* seconds. It means that probably ({_}time_to_rebalance <<<< time_to_reindex{_}) rebalancing process is much faster than reindexing. In that case, rebalancing could fail because it will hit the timeout. In such scenario, you can change `{_}lexorank.rank.balancer.time.to.finish.reindexing.in.seconds{_}` property. Remember that the time presents a timeout *per 1* issue. If you change the property to {+}10s{+}, effectively rebalancing process will wait for reindexing for *640s* ({_}property_timeout*size_of_queue=10s*64=640s{_}).
Changing timeouts for rebalancing should be *rarely or never* used. It should be changed only as a *last* {*}resort{*}. We *+encourage+* you to investigate the issue with the reindexing time. Time to reindex 1 issue longer than 5s, in our opinion, represents +unhealthy+ instances.

–

2. Additionally, errors/exceptions in the reindex process should be propagated and properly logged in. In case of errors, you can observe such logs:
* `{_}Indexing failed for rebalanced issues. Errors:{_}` + remaining part of the error.

* `{_}LexoRank balancer failed,{_}` + remaining part of the error.

As we were not able to identify why reindexing was failing due to hidden errors, we have created a new bug. If you encounter an issue with rebalancing failing with the errors described above, please contact our support or share your errors on this bug → [JSWSERVER-26104|https://jira.atlassian.com/browse/JSWSERVER-26104]

–

3. Be aware that during our investigation, we found an issue with the LexoRank management tab. No errors with rebalancing are propagated to the UI. More about that in this bug → [JSWSERVER-26105|https://jira.atlassian.com/browse/JSWSERVER-26105]

Best regards.

Daniel Dudziak
Principal Software Engineer
{panel}

New: {panel:title=Atlassian Update – 16 September 2024|borderStyle=solid|borderColor=#deebff|titleBGColor=#deebff|bgColor=#deebff}
Thank you for reporting this issue. We have been working on fixing it in the last weeks. We're happy to announce that this issue is fixed in the +10.2.0+ release.

*Summary of the problem:*

During rebalancing, a new issue rank needs to be reindexed. This operation is done asynchronously. Both processes share a queue of issues to be reindexed. After an issue is rebalanced, it is put into the queue and then taken up by the reindex process. The queue has a maximum limit of 64 issues. In case the reindex process {*}fails{*}, the queue is full after some time. Hence, rebalance operations used to wait an *infinite* amount of time to put a new item into an already full queue. Queue that was never consumed due to reindex failure.

–

On top of that, errors/exceptions in the reindex process were never propagated or logged into log files. Hence, any errors with reindexing are lost, and we cannot trace why or how it can fail.

*New behaviour after the change:*

1. Since now, the rebalancing thread will not wait an *infinite* amount of time to put a new item into the queue. We added a timeout of *5* seconds per issue (a total of *320* seconds for the whole queue). If the rebalancing process hits the timeout, it will fail with an error: `{_}Indexing Queue for rebalanced issues overflow, reindex stuck or is taking longer than expected > PT320s`{_}.

There is a small possibility that your process of reindexing for *1* issue is longer than *5* seconds. It means that probably ({_}time_to_rebalance <<<< time_to_reindex{_}) rebalancing process is much faster than reindexing. In that case, rebalancing could fail because it will hit the timeout. In such scenario, you can change `{_}lexorank.rank.balancer.time.to.finish.reindexing.in.seconds{_}` property. Remember that the time presents a timeout *per 1* issue. If you change the property to {+}10s{+}, effectively rebalancing process will wait for reindexing for *640s* ({_}property_timeout*size_of_queue=10s*64=640s{_}).
Changing timeouts for rebalancing should be *rarely or never* used. It should be changed only as a *last* {*}resort{*}. We *+encourage+* you to investigate the issue with the reindexing time. Time to reindex 1 issue longer than 5s, in our opinion, represents +unhealthy+ instances.

–

2. Additionally, errors/exceptions in the reindex process should be propagated and properly logged in. In case of errors, you can observe such logs:
* `{_}Indexing failed for rebalanced issues. Errors:{_}` + remaining part of the error.

* `{_}LexoRank balancer failed,{_}` + remaining part of the error.

As we were not able to identify why reindexing was failing due to hidden errors, we have created a new bug. If you encounter an issue with rebalancing failing with the errors described above, please contact our support or share your errors on this bug → [JSWSERVER-26104|https://jira.atlassian.com/browse/JSWSERVER-26104]

–

3. Be aware that during our investigation, we found an issue with the LexoRank management tab. No errors with rebalancing are propagated to the UI. More about that in this bug → [JSWSERVER-26105|https://jira.atlassian.com/browse/JSWSERVER-26105]

Best regards.

Daniel Dudziak
Principal Software Engineer
{panel}

Stasiu made changes - 16/Sep/2024 9:17 AM

Current Status

New: {panel:title=Atlassian Update – 16 September 2024|borderStyle=solid|borderColor=#deebff|titleBGColor=#deebff|bgColor=#deebff}
Thank you for reporting this issue. We have been working on fixing it in the last weeks. We're happy to announce that this issue is fixed in the +10.2.0+ release.

*Summary of the problem:*

During rebalancing, a new issue rank needs to be reindexed. This operation is done asynchronously. Both processes share a queue of issues to be reindexed. After an issue is rebalanced, it is put into the queue and then taken up by the reindex process. The queue has a maximum limit of 64 issues. In case the reindex process {*}fails{*}, the queue is full after some time. Hence, rebalance operations used to wait an *infinite* amount of time to put a new item into an already full queue. Queue that was never consumed due to reindex failure.

–

On top of that, errors/exceptions in the reindex process were never propagated or logged into log files. Hence, any errors with reindexing are lost, and we cannot trace why or how it can failed.

*New behaviour after the change:*

1. Since now, the rebalancing thread will not wait an *infinite* amount of time to put a new item into the queue. We added a timeout of *5* seconds per issue (a total of *320* seconds for the whole queue). If the rebalancing process hits the timeout, it will fail with an error: `{_}Indexing Queue for rebalanced issues overflow, reindex stuck or is taking longer than expected > PT320s`{_}.

There is a small possibility that your process of reindexing for *1* issue is longer than *5* seconds. It means that probably ({_}time_to_rebalance <<<< time_to_reindex{_}) rebalancing process is much faster than reindexing. In that case, rebalancing could fail because it will hit the timeout. In such scenario, you can change `{_}lexorank.rank.balancer.time.to.finish.reindexing.in.seconds{_}` property. Remember that the time presents a timeout *per 1* issue. If you change the property to {+}10s{+}, effectively rebalancing process will wait for reindexing for *640s* ({_}property_timeout*size_of_queue=10s*64=640s{_}).
Changing timeouts for rebalancing should be *rarely or never* used. It should be changed only as a *last* {*}resort{*}. We *+encourage+* you to investigate the issue with the reindexing time. Time to reindex 1 issue longer than 5s, in our opinion, represents +unhealthy+ instances.

–

2. Additionally, errors/exceptions in the reindex process should be propagated and properly logged in. In case of errors, you can observe such logs:
* `{_}Indexing failed for rebalanced issues. Errors:{_}` + remaining part of the error.

* `{_}LexoRank balancer failed,{_}` + remaining part of the error.

As we were not able to identify why reindexing was failing due to hidden errors, we have created a new bug. If you encounter an issue with rebalancing failing with the errors described above, please contact our support or share your errors on this bug → [JSWSERVER-26104|https://jira.atlassian.com/browse/JSWSERVER-26104]

–

3. Be aware that during our investigation, we found an issue with the LexoRank management tab. No errors with rebalancing are propagated to the UI. More about that in this bug → [JSWSERVER-26105|https://jira.atlassian.com/browse/JSWSERVER-26105]

Best regards.

Daniel Dudziak
Principal Software Engineer
{panel}

Stasiu made changes - 16/Sep/2024 9:13 AM

Status

Original: In Progress [ 3 ]

New: Waiting for Release [ 12075 ]

Stasiu made changes - 16/Sep/2024 9:13 AM

Fix Version/s

New: 10.2.0 [ 108819 ]

Stasiu made changes - 16/Sep/2024 9:02 AM

Link

New: This issue relates to JSWSERVER-26105 [ JSWSERVER-26105 ]

Assignee:: Stasiu

Reporter:: Andriy Yakovlev [Atlassian]

Affected customers:: 22 This affects my team

Watchers:: 33 Start watching this issue

Created:: 25/Aug/2017 9:26 AM

Updated:: 20/Nov/2024 3:50 PM

Resolved:: 20/Nov/2024 3:50 PM

Details

Description

Summary

Expected Results

Actual Results

Notes

Workaround

Attachments

Issue Links

Forms

Activity

People

Dates