The expectation system executes each job on the node that submitted it. When the submitting thread is a request thread, the system generates a unique identifier for each submission, enqueues it on a PSMQ for the issue to which the given event pertains, then the OffThreadEventJobRunner uses an unbounded ThreadPoolExecutor (
JSDSERVER-5732) to spawn a thread that’s equipped with that unique identifier.
Each job execution thread locks the queue on which it expects to see its unique identifier (UPDATE QUEUE), locks the message at the head of the queue expecting that the message contains its unique identifier (BEGIN + UPDATE QUEUE + SELECT MESSAGE + UPDATE MESSAGE + UPDATE QUEUE + COMMIT), dequeues the message (BEGIN + UPDATE QUEUE + DELETE MESSAGE + UPDATE QUEUE + COMMIT), and finally executes the associated work.
There are many points at which this process can fail under contention. While most of the failure modes that PSMQ handles with immediate retries up to 10000 ms should be unlikely (fewer messages than readers; next message has expired; message dequeued by other reader between peek and dequeue), there are at least two failure modes that we handle by repeating the process after sleeping for 5 ms (queue locked by other reader; next message doesn’t contain thread’s unique identifier).
This approach is obviously inefficient, but I can see how it might be an appropriate choice under an assumption like “contention for a given queue should only occur rarely and randomly”. That assumption, at the very least, doesn’t seem to hold, given the reports that some bulk issue operation and ScriptRunner usage patterns can easily result in contention, and evidence from our initial investigation that actions as simple as posting a comment almost always result in contention.
In the case of posting a comment, the contention occurs because our EventListenerLauncher appears to submit one SLA cycle updater job for each of the two commit-wrapped events that are dispatched by JIRA. The usual outcome is that we spawn two job execution threads in quick succession that contend for the same queue, and one of the threads issues at least ten unsuccessful UPDATE QUEUE queries under ideal conditions (no other activity, local database, empty instance, task management project, and the thread for the first job wins the race).