Issue Summary
Jira inefficiently populates Sprint cache (com.atlassian.greenhopper.service.sprint.SprintManagerImpl.sprintCache) due to loading all elements in one go.
The sprint cache population may become slow depending on the number of sprints that exist on the instance. In addition to that, some external factors such as the latency between the JVM and the DB server can also contribute to this behavior.
In some cases 15k rounds trips between the application and the database (see details below) will be required for the entire data set to be retrieved by the thread populating the cache - this translates to a total of 15 seconds of waiting time, if the latency between Jira and the DB server is at 1ms. An increased latency between the app server and the DB server will increase the time it takes for the full results of the query to reach the application exponentially.
Steps to Reproduce
- Set up Jira (8.5.4, 7.6.13)
- Create 150k sprints.
- Complete any sprint (this triggers a sprint cache flush)
- Same fore createSprint, deleteSprint, updateSprint
Expected Results
The sprint cache will be populated again very quickly.
Actual Results
It will take 20+ seconds or even minutes for the sprint cache to be fully populated. During this period, any other threads that require sprint information will be parked, as the sprint cache is not available to them at this time. Once the thread currently populating the sprint cache finishes retrieving the entire data set, other threads will progress normally.
The thread populating the sprint will have a similar stack:
sun.nio.ch.FileDispatcherImpl.read0(Native Method)
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
sun.nio.ch.IOUtil.read(IOUtil.java:197)
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
oracle.net.nt.TimeoutSocketChannel.read(TimeoutSocketChannel.java:144)
oracle.net.ns.NIOHeader.readHeaderBuffer(NIOHeader.java:82)
oracle.net.ns.NIOPacket.readFromSocketChannel(NIOPacket.java:139)
oracle.net.ns.NIOPacket.readFromSocketChannel(NIOPacket.java:101)
oracle.net.ns.NIONSDataChannel.readDataFromSocketChannel(NIONSDataChannel.java:80)
oracle.jdbc.driver.T4CMAREngineNIO.prepareForReading(T4CMAREngineNIO.java:98)
oracle.jdbc.driver.T4CMAREngineNIO.unmarshalUB1(T4CMAREngineNIO.java:534)
oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:485)
oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:252)
oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:612)
oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:226)
...
com.sun.proxy.$Proxy3126.stream(Unknown Source)
com.atlassian.greenhopper.service.sprint.SprintDao.loadAll(SprintDao.java:19)
com.atlassian.greenhopper.service.sprint.SprintManagerImpl$SprintCacheSupplier.get(SprintManagerImpl.java:432)
com.atlassian.greenhopper.service.sprint.SprintManagerImpl$SprintCacheSupplier.get(SprintManagerImpl.java:425)
com.atlassian.cache.compat.delegate.DelegatingSupplier.get(DelegatingSupplier.java:22)
com.atlassian.cache.ehcache.EhCacheManager$SupplierAdapter.load(EhCacheManager.java:260)
...
com.atlassian.jira.cluster.cache.ehcache.BlockingParallelCacheReplicator.runDeferred(BlockingParallelCacheReplicator.java:172)
com.atlassian.jira.cache.DeferredReplicationCachedReference.get(DeferredReplicationCachedReference.java:28)
com.atlassian.cache.compat.delegate.DelegatingCachedReference.get(DelegatingCachedReference.java:22)
com.atlassian.greenhopper.service.sprint.SprintManagerImpl.getAllSprints(SprintManagerImpl.java:124)
com.atlassian.greenhopper.service.sprint.SprintManagerImpl.getSprints(SprintManagerImpl.java:132)
While threads that are trying to access the sprint cache will be parked at the following stack:
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000206cdbf10> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at net.sf.ehcache.concurrent.ReadWriteLockSync.lock(ReadWriteLockSync.java:50)
at net.sf.ehcache.constructs.blocking.BlockingCache.acquiredLockForKey(BlockingCache.java:196)
at net.sf.ehcache.constructs.blocking.BlockingCache.get(BlockingCache.java:158)
at com.atlassian.cache.ehcache.LoadingCache.get(LoadingCache.java:79)
at com.atlassian.cache.ehcache.DelegatingCachedReference.get(DelegatingCachedReference.java:73)
at com.atlassian.jira.cache.DeferredReplicationCachedReference$$Lambda$265/1538681607.get(Unknown Source)
at com.atlassian.jira.cluster.cache.ehcache.BlockingParallelCacheReplicator.runDeferred(BlockingParallelCacheReplicator.java:172)
at com.atlassian.jira.cache.DeferredReplicationCachedReference.get(DeferredReplicationCachedReference.java:28)
at com.atlassian.cache.compat.delegate.DelegatingCachedReference.get(DelegatingCachedReference.java:22)
at com.atlassian.greenhopper.service.sprint.SprintManagerImpl.getSprint(SprintManagerImpl.java:110)
...
Notes
- Problem get amplified for Oracle DB. The reason for this is that by default, the Oracle JDBC driver fetch results of a SQL query streaming it from DB by 10 rows at a time. In a situation where a table has 150k records, 15k round trips will be required for the complete information of the table to reach the application. When the sprint cache gets flushed, a full table scan of the AO_60DB71_SPRINT table is performed to populate the sprint cache.
- Using an instance where 150k sprints exist as an example in Oracle DB , 15k rounds trips between the application and the database will be required for the entire data set to be retrieved by the thread waiting on this information - this translates to a total of 15 seconds if the latency between Jira and the DB server is at 1ms. An increased latency between the app server and the DB server will increase the time it takes for the full results of the query to reach the application exponentially.
Workaround