Uploaded image for project: 'Jira Software Data Center'
  1. Jira Software Data Center
  2. JSWSERVER-25340

Sprint cache invalidation may block other regular operations leading to performance degradation of the Jira Software Data Center instance

    XMLWordPrintable

Details

    Description

      Issue Summary

      This is reproducible on Data Center: yes

      During period of high activity related to Sprints such as create, close, open and reopen, these operations wait for the others to finish as they are all dependent on cache (sprintCache). They also trigger a flush on the cache which invalidates is on every node leading to a scenario where the cache is constantly being refreshed.

      On instances with a high count of Sprints on AO_60DB71_SPRINT each cache refresh will take longer and this cascades into threads stuck waiting for them to finish. When we combine that with high activity it may lead to thread exhaustion and very long delays when performing Sprint activities.

      There are also JQL functions related to Sprint which need to wait for the sprintCache refreshes to finish; these functions are FutureSprints(), ClosedSprints() and OpenSprints().

      Note:

      For reference, two new caches were added (sprintCacheById and sprintCacheRapidViewIdToSprints) per Performance of Jira can degrade significantly due to slow sprint cache population (JSWSERVER-20618). The cache causing a constraint here is the one that existed originally, sprintCache, which contains the entire AO_60DB71_SPRINT table.

      Steps to Reproduce

      The scenario we're trying to reproduce is a very large instance on users are performing their Sprint planning activities.

      1. Create 90k sprints.
      2. Perform operations that trigger cache flush of sprintCache; start several processes performing Sprint operations such as create, start, close or reopen.
      3. At the same time, trigger JQL searches that use sprint-related functions (FutureSprints, ClosedSprints, OpenSprints).

      Expected Results

      The operation would finish in a reasonable amount of time since they're fetching from a cache.

      Actual Results

      Each sprint operation that triggers the updateSprint method will trigger cache flush (code from Jira 9.1+):

      • This does not include adding or removing issues from a sprint, only create, start close or reopen.
          private ServiceOutcome<Sprint> updateSprint(Sprint sprint, PartialUpdate partialUpdate) {
              final Long sprintId = sprint.getId();
              ServiceOutcome<SprintAO> existing = sprintDao.load(sprintId);
      
              if (!existing.isValid()) {
                  return error(existing);
              }
      
              Sprint existingSprint = sprintAOMapper.toModel(existing.get());
      
              // flush AO cache
              sprintDao.flushAll();
      
              partialUpdate.apply(sprint, existing.getValue());
              sprintDao.save(existing.get());
      
              // reload record to try prevent stale record being in cache
              flushCacheEntriesRelatedToSprint(sprint);
      
              // It is possible that sprint has been moved between boards, in this case we need to invalidate also the previous board's entry.
              if (existingSprint.getRapidViewId() != null && !existingSprint.getRapidViewId().equals(sprint.getRapidViewId())) {
                  rapidViewIdToSprintsCache.remove(existingSprint.getRapidViewId());
              }
      

      When the cache is flushed, it also sends a message to invalidate sprintCache on other nodes. When the node is refreshing this cache after an updateSprint or due to the invalidation coming from other nodes, all similar operations will wait for the cache to be available.

      Thread dumps below show that at nearly every snapshot we had on thread running the sprintDao.loadAll method to refresh the cache from DB

       

      Sprint-related JQL functions (FutureSprints(), ClosedSprints, OpenSprints()) will use getAllSprints method and thus need the sprintCache. They do not trigger a flush but they will also wait for cache to be healthy; they extend AbstractSprintsStateJqlFunction and due to the nature of the cache not tracking the Sprint state they resort to getAllSprints method.

      Thread dumps shows threading waiting for the cache to be available

      • This scenario may lead to thread exhaustion with several threads waiting for the cache to be cleared.
      • The cache is constantly being refreshed on each node and each will be doing a full table scan on AO_60DB71_SPRINT.
      • Users may face timeouts or long delays on their JQL search or during their Sprint activities.

      Workaround

      While there's currently no method of disabling the cache nor stopping these acitivies from triggering sprintCache flush, it's possible to make the cache refreshes faster by having a lower number of Sprints. Current guardrails suggest a maximum of 60,000 sprints.

      Client may also utilize Auto-managed Sprints (introduced in Jira 9.8.0) to reduce the number of Sprint start/close at the same time. This feature does require Parallel Sprints to be enabled.

      Attachments

        Issue Links

          Activity

            People

              klopacinski Karol Lopacinski
              a3c7c2a06f95 Filipi Lima
              Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: