Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Low
Fix Version/s: 9.14.0, 9.12.5
Affects Version/s: 9.1.0, 9.6.0, 9.9.1
Component/s: Scrum Board, Sprint
Labels:
- pse-request

Introduced in Version:
9.01
Support reference count:
7
Symptom Severity:
Severity 2 - Major
UIS:
44
Bug Fix Policy:
View Atlassian Server bug fix policy

Description

Issue Summary

This is reproducible on Data Center: yes

During period of high activity related to Sprints such as create, close, open and reopen, these operations wait for the others to finish as they are all dependent on cache (sprintCache). They also trigger a flush on the cache which invalidates is on every node leading to a scenario where the cache is constantly being refreshed.

On instances with a high count of Sprints on AO_60DB71_SPRINT each cache refresh will take longer and this cascades into threads stuck waiting for them to finish. When we combine that with high activity it may lead to thread exhaustion and very long delays when performing Sprint activities.

There are also JQL functions related to Sprint which need to wait for the sprintCache refreshes to finish; these functions are FutureSprints(), ClosedSprints() and OpenSprints().

Note:

For reference, two new caches were added (sprintCacheById and sprintCacheRapidViewIdToSprints) per Performance of Jira can degrade significantly due to slow sprint cache population (~~JSWSERVER-20618~~). The cache causing a constraint here is the one that existed originally, sprintCache, which contains the entire AO_60DB71_SPRINT table.

Steps to Reproduce

The scenario we're trying to reproduce is a very large instance on users are performing their Sprint planning activities.

Create 90k sprints.
Perform operations that trigger cache flush of sprintCache; start several processes performing Sprint operations such as create, start, close or reopen.
At the same time, trigger JQL searches that use sprint-related functions (FutureSprints, ClosedSprints, OpenSprints).

Expected Results

The operation would finish in a reasonable amount of time since they're fetching from a cache.

Actual Results

Each sprint operation that triggers the updateSprint method will trigger cache flush (code from Jira 9.1+):

This does not include adding or removing issues from a sprint, only create, start close or reopen.

    private ServiceOutcome<Sprint> updateSprint(Sprint sprint, PartialUpdate partialUpdate) {
        final Long sprintId = sprint.getId();
        ServiceOutcome<SprintAO> existing = sprintDao.load(sprintId);

        if (!existing.isValid()) {
            return error(existing);
        }

        Sprint existingSprint = sprintAOMapper.toModel(existing.get());

        // flush AO cache
        sprintDao.flushAll();

        partialUpdate.apply(sprint, existing.getValue());
        sprintDao.save(existing.get());

        // reload record to try prevent stale record being in cache
        flushCacheEntriesRelatedToSprint(sprint);

        // It is possible that sprint has been moved between boards, in this case we need to invalidate also the previous board's entry.
        if (existingSprint.getRapidViewId() != null && !existingSprint.getRapidViewId().equals(sprint.getRapidViewId())) {
            rapidViewIdToSprintsCache.remove(existingSprint.getRapidViewId());
        }

When the cache is flushed, it also sends a message to invalidate sprintCache on other nodes. When the node is refreshing this cache after an updateSprint or due to the invalidation coming from other nodes, all similar operations will wait for the cache to be available.

Thread dumps below show that at nearly every snapshot we had on thread running the sprintDao.loadAll method to refresh the cache from DB

Sprint-related JQL functions (FutureSprints(), ClosedSprints, OpenSprints()) will use getAllSprints method and thus need the sprintCache. They do not trigger a flush but they will also wait for cache to be healthy; they extend AbstractSprintsStateJqlFunction and due to the nature of the cache not tracking the Sprint state they resort to getAllSprints method.

Thread dumps shows threading waiting for the cache to be available

This scenario may lead to thread exhaustion with several threads waiting for the cache to be cleared.
The cache is constantly being refreshed on each node and each will be doing a full table scan on AO_60DB71_SPRINT.
Users may face timeouts or long delays on their JQL search or during their Sprint activities.

Workaround

While there's currently no method of disabling the cache nor stopping these acitivies from triggering sprintCache flush, it's possible to make the cache refreshes faster by having a lower number of Sprints. Current guardrails suggest a maximum of 60,000 sprints.

Client may also utilize Auto-managed Sprints (introduced in Jira 9.8.0) to reduce the number of Sprint start/close at the same time. This feature does require Parallel Sprints to be enabled.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List

getAllSprints threads waiting for cache refresh to finish.png
7.00 MB
19/Oct/2023 8:17 PM
sprintDao loadAll method running most of the time.png
6.48 MB
19/Oct/2023 8:18 PM

Issue Links

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

relates to: ADOBE-479 Loading...; PSR-878 Loading...

resolves: ACE-4499 Loading...

(1 mentioned in, 2 relates to, 1 resolves)

Sprint cache invalidation may block other regular operations leading to performance degradation of the Jira Software Data Center instance

Details

Description

Issue Summary

Steps to Reproduce

Expected Results

Actual Results

Workaround

Attachments

Attachments

Issue Links

Activity

People

Dates