Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-14220

Ensure the index optimize operation does not cause index lock timeouts

    • We collect Jira feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

      In jira-application.properties, we say that JIRA should wait at most 30s to an index lock before giving up:

      # Specified the 'wait time' for a file lock in the Lucene IssueIndexManager (in milliseconds)
      # This value should only be modified if you are seeing a jira.issue.index.DefaultIndexManager 'Giving up reindex' ERROR
      # in your log files or requested to do so by Atlassian support.
      jira.index.lock.waittime=30000
      

      For an average size JIRA (33k issues) on average hardware, JIRA's nightly index optimization takes 48s. On the ASF JIRA, 75k issues takes 208s. If any write operation takes place during this time it will fail.

      Clearly, 30s hangs are not outside the bounds of possibility. Could we increase this to 300s or something?

            [JRASERVER-14220] Ensure the index optimize operation does not cause index lock timeouts

            G B added a comment -

            I believe we are experiencing this problem with Jira 3.13 (our index optimizations fail due to timeout). I'm not entirely sure I complete understand what the issue is here, though. Is there documentation somewhere, or could someone explain? What is the right thing to do to deal with this in Jira 3.13 before Jira 4.0 is relased? Should we just increase jira.index.lock.waittime, should we raise jira.index.max.reindexes to attempt to force the reindex to happen only during periods of low load (difficult because we have a global user base) or something else?

            G B added a comment - I believe we are experiencing this problem with Jira 3.13 (our index optimizations fail due to timeout). I'm not entirely sure I complete understand what the issue is here, though. Is there documentation somewhere, or could someone explain? What is the right thing to do to deal with this in Jira 3.13 before Jira 4.0 is relased? Should we just increase jira.index.lock.waittime, should we raise jira.index.max.reindexes to attempt to force the reindex to happen only during periods of low load (difficult because we have a global user base) or something else?

            Index updates are now put in a queue. The client may timeout waiting for the queue to get processed but the operation remains on the queue.

            Jed Wesley-Smith (Inactive) added a comment - Index updates are now put in a queue. The client may timeout waiting for the queue to get processed but the operation remains on the queue.

            If we can distinguish a lock held by an optimize operation, great. We can whack a huge timeout on it, and leave a 30s timeout for regular operations.

            Jeff Turner added a comment - If we can distinguish a lock held by an optimize operation, great. We can whack a huge timeout on it, and leave a 30s timeout for regular operations.

            AntonA added a comment -

            That might be an idea. I guess if we know taht the lock is held by the optimise operation then there could be very few things wrong. Thread dumps scare people when they appear in the logs, so I guess I want to print it only when there is a problem.

            Maybe we should just have a different, much, much larger timeout for the lock, when the lock is held by the optimize operation. What do you think?

            AntonA added a comment - That might be an idea. I guess if we know taht the lock is held by the optimise operation then there could be very few things wrong. Thread dumps scare people when they appear in the logs, so I guess I want to print it only when there is a problem. Maybe we should just have a different, much, much larger timeout for the lock, when the lock is held by the optimize operation. What do you think?

            I think the thread dump after 30s is definitely useful and worth keeping. Sometimes lock contention is normal (eg. during optimizations) but often it is a sign that things are broken.

            How about generating the thread dump after 30 seconds, but not throw an IndexException. Just try again to acquire the lock, this time waiting indefinitely.

            Jeff Turner added a comment - I think the thread dump after 30s is definitely useful and worth keeping. Sometimes lock contention is normal (eg. during optimizations) but often it is a sign that things are broken. How about generating the thread dump after 30 seconds, but not throw an IndexException. Just try again to acquire the lock, this time waiting indefinitely.

            AntonA added a comment -

            It will block them, optimise cannot occur on a Lucene index concurrently to another write. However, the idea is to make them not time out (but rather wait indefinitely).

            AntonA added a comment - It will block them, optimise cannot occur on a Lucene index concurrently to another write. However, the idea is to make them not time out (but rather wait indefinitely).

            If we can have optimising not block regular operations (eg. adding comments), great. 75k issues (640Mb of Lucene indexes) is not unusual, and 208s per day is quite a bit.

            Jeff Turner added a comment - If we can have optimising not block regular operations (eg. adding comments), great. 75k issues (640Mb of Lucene indexes) is not unusual, and 208s per day is quite a bit.

            AntonA added a comment -

            I think we need to spend a lot more time to see why optimise takes so long. With the values of 300s, I am afraid that it will mask too many other problems during other operations (i.e. not optimise).

            I think we need to change the bahaviour such that optimise does not cause timeouts.

            What do you think?

            AntonA added a comment - I think we need to spend a lot more time to see why optimise takes so long. With the values of 300s, I am afraid that it will mask too many other problems during other operations (i.e. not optimise). I think we need to change the bahaviour such that optimise does not cause timeouts. What do you think?

              Unassigned Unassigned
              7ee5c68a815f Jeff Turner
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: