-
Bug
-
Resolution: Timed out
-
Low
-
None
-
7.1.2
-
7.01
-
7
-
Severity 2 - Major
-
3
-
On our test JIRA instance (JIRA 7.1.2), we've encountered a case where any administration page in JIRA would fail to render. All non-administrative pages render normally, however. The problem is that the request serving threads participate in a deadlock.
Example stack trace:
"http-nio-8080-exec-24" #111 daemon prio=5 os_prio=0 tid=0x00007efd3186b000 nid=0x132 waiting on condition [0x00007efd1802f000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000dc306748> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at com.atlassian.beehive.simple.SingleNodeLock.lock(SingleNodeLock.java:60) at com.atlassian.jira.config.DefaultReindexMessageManager.getMessage(DefaultReindexMessageManager.java:170) at jsp.decorators.admin_jsp._jspService(admin_jsp.java:315) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:725) ...
Any thread that serves a request to an administrative page becomes locked on the same lock without progress.
This lock is taken by this thread, which is also locked:
"http-nio-8080-exec-14" #99 daemon prio=5 os_prio=0 tid=0x00007efd30ed2800 nid=0x126 waiting on condition [0x00007efd19057000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000d2700198> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2218) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) at com.google.common.cache.LocalCache.get(LocalCache.java:3937) at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4739) at com.atlassian.cache.memory.DelegatingCache.get(DelegatingCache.java:173) at com.atlassian.cache.memory.DelegatingCache.get(DelegatingCache.java:158) at com.atlassian.jira.propertyset.CachingOfBizPropertyEntryStore.resolve(CachingOfBizPropertyEntryStore.java:111) at com.atlassian.jira.propertyset.CachingOfBizPropertyEntryStore.getType(CachingOfBizPropertyEntryStore.java:141) at com.atlassian.jira.propertyset.CachingOfBizPropertySet.exists(CachingOfBizPropertySet.java:139) at com.atlassian.jira.config.DefaultReindexMessageManager.getMessage(DefaultReindexMessageManager.java:175) at jsp.decorators.admin_jsp._jspService(admin_jsp.java:315) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:725) ...
It is not apparent which thread has locked the Guava cache segment. However, there's also another thread locked on the same segment:
"Caesium-1-3" #148 daemon prio=5 os_prio=0 tid=0x00007efd39918000 nid=0x156 waiting on condition [0x00007efd2aa16000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000d2700198> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at com.google.common.cache.LocalCache$Segment.remove(LocalCache.java:3067) at com.google.common.cache.LocalCache.remove(LocalCache.java:4173) at com.google.common.cache.LocalCache$LocalManualCache.invalidate(LocalCache.java:4765) at com.atlassian.cache.memory.DelegatingCache.remove(DelegatingCache.java:237) at com.atlassian.jira.propertyset.CachingOfBizPropertyEntryStore.invalidateCacheEntry(CachingOfBizPropertyEntryStore.java:239) at com.atlassian.jira.propertyset.CachingOfBizPropertyEntryStore.removeEntry(CachingOfBizPropertyEntryStore.java:160) at com.atlassian.jira.propertyset.CachingOfBizPropertySet.remove(CachingOfBizPropertySet.java:165) at com.atlassian.sal.jira.pluginsettings.JiraPluginSettings.putActual(JiraPluginSettings.java:24) at com.atlassian.sal.core.pluginsettings.AbstractStringPluginSettings.put(AbstractStringPluginSettings.java:97) at com.atlassian.upm.core.impl.NamespacedPluginSettings.put(NamespacedPluginSettings.java:34) at com.atlassian.upm.notification.PluginSettingsNotificationCache.saveNotifications(PluginSettingsNotificationCache.java:325) ...
Note that this thread holds the OneShotLatch lock, acquired in DelegatingCache.acquireLockFor(). And there's another thread that is waiting to acquire this lock:
"Caesium-1-2" #147 daemon prio=5 os_prio=0 tid=0x00007efd38a64000 nid=0x155 waiting on condition [0x00007efd2a59d000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000e7747500> (a com.atlassian.cache.impl.OneShotLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at com.atlassian.cache.impl.OneShotLatch.await(OneShotLatch.java:31) at com.atlassian.cache.memory.DelegatingCache.acquireLockFor(DelegatingCache.java:455) at com.atlassian.cache.memory.DelegatingCache.remove(DelegatingCache.java:234) at com.atlassian.jira.propertyset.CachingOfBizPropertyEntryStore.invalidateCacheEntry(CachingOfBizPropertyEntryStore.java:239) at com.atlassian.jira.propertyset.CachingOfBizPropertyEntryStore.removeEntry(CachingOfBizPropertyEntryStore.java:160) at com.atlassian.jira.propertyset.CachingOfBizPropertySet.remove(CachingOfBizPropertySet.java:165) at com.atlassian.sal.jira.pluginsettings.JiraPluginSettings.putActual(JiraPluginSettings.java:24) at com.atlassian.sal.core.pluginsettings.AbstractStringPluginSettings.put(AbstractStringPluginSettings.java:85) at com.atlassian.analytics.client.upload.S3EventUploader.storeLastUploadDate(S3EventUploader.java:91) at com.atlassian.analytics.client.upload.S3EventUploader.execute(S3EventUploader.java:84) at com.atlassian.scheduler.compat.local.LocalOnlyPluginJob.execute(LocalOnlyPluginJob.java:30) ...
I don't have source code for JIRA 7.1.2 at the moment, but by looking into sources of atlassian-cache-memory-2.11.1 (supplied with JIRA 7.1.9), I've noticed indeed a deadlock situation where locks are acquired in different order:
DelegatingCache.remove():
- acquireLockFor(key)
- internalCache.invalidate() — acquires segment lock internally
DelegatingCache.get(): - internalCache.get() — acquires segment lock internally
- acquireLockFor(key)
It might be that this problem is related. Needs more thorough investigation.