Uploaded image for project: 'Jira Service Management Data Center'
  1. Jira Service Management Data Center
  2. JSDSERVER-5730

OffThreadEventJobRunner job execution threads wait for their turn in a very expensive way

    XMLWordPrintable

Details

    Description

      Atlassian Update – 10 September 2021

      Hi everyone,

      We have reopened this bug after investigation of a related issue documented in JSDSERVER-5736

      • This is related to deadlocking of threads when there are frequent actions on one request. A workaround for this was released behind a dark feature in 4.9.0
      • The development team will be working on a permanent fix shortly.

      For details on the workaround please see the related documentation - 

      We hope to resolve this issue as soon as possible!

      If you have further concerns with the above, please open a support ticket via https://support.atlassian.com

      Thank you,

      Alex

       

      The expectation system executes each job on the node that submitted it. When the submitting thread is a request thread, the system generates a unique identifier for each submission, enqueues it on a PSMQ for the issue to which the given event pertains, then the OffThreadEventJobRunner uses an unbounded ThreadPoolExecutor (JSDSERVER-5732) to spawn a thread that’s equipped with that unique identifier.

      Each job execution thread locks the queue on which it expects to see its unique identifier (UPDATE QUEUE), locks the message at the head of the queue expecting that the message contains its unique identifier (BEGIN + UPDATE QUEUE + SELECT MESSAGE + UPDATE MESSAGE + UPDATE QUEUE + COMMIT), dequeues the message (BEGIN + UPDATE QUEUE + DELETE MESSAGE + UPDATE QUEUE + COMMIT), and finally executes the associated work.

      There are many points at which this process can fail under contention. While most of the failure modes that PSMQ handles with immediate retries up to 10000 ms should be unlikely (fewer messages than readers; next message has expired; message dequeued by other reader between peek and dequeue), there are at least two failure modes that we handle by repeating the process after sleeping for 5 ms (queue locked by other reader; next message doesn’t contain thread’s unique identifier).

      This approach is obviously inefficient, but I can see how it might be an appropriate choice under an assumption like “contention for a given queue should only occur rarely and randomly”. That assumption, at the very least, doesn’t seem to hold, given the reports that some bulk issue operation and ScriptRunner usage patterns can easily result in contention, and evidence from our initial investigation that actions as simple as posting a comment almost always result in contention.

      In the case of posting a comment, the contention occurs because our EventListenerLauncher appears to submit one SLA cycle updater job for each of the two commit-wrapped events that are dispatched by JIRA. The usual outcome is that we spawn two job execution threads in quick succession that contend for the same queue, and one of the threads issues at least ten unsuccessful UPDATE QUEUE queries under ideal conditions (no other activity, local database, empty instance, task management project, and the thread for the first job wins the race).

      Attachments

        Issue Links

          Activity

            People

              esantos2 Elton Santos
              dazabani Delan Azabani (Inactive)
              Votes:
              38 Vote for this issue
              Watchers:
              53 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Backbone Issue Sync