-
Bug
-
Resolution: Fixed
-
High (View bug fix roadmap)
-
7.6.9, 8.5.3, 7.13.18, 8.13.1, 8.13.4, 8.18.2
-
7.06
-
15
-
Severity 2 - Major
-
30
-
The main fix for this problem was delivered in 8.19.1 where we got rid of the contention caused by mentions. Note that in 8.22.0/8.20.8 we have removed the contention caused by user login action. See JRASERVER-70468 for more information.
Issue Summary
Mostly during the start of the day (when Jira has heavy login pressure from users), customer experienced a severe performance degradation (up to the point that system looks unresponsive). Upon further digging into thread dumps, we found 2 sets of problems:
- A large number of threads waiting for ReadLock (action - /rest/internal/2/user/mention/search)
- At the same time, a group of thread waiting for a WriteLock (action - /plugins/servlet/samlconsumer)
both on DirectoryUserIndexer.java.
This is caused by contention in the code and leads to situation, where a lot of threads (waiting for a ReadLock) are waiting on one thread that's supposed to get a WriteLock to update the UserIndex, but can't as it also waits on another thread which is currently having a ReadLock on UserIndex and doing the search.
This causes cascading contention and degrades user experience in Jira for other actions (see JSWSERVER-20336).
Steps to Reproduce
Details from the customer instance:
- 300k+ user
- Massive amount of users logging in simultaneously at the start of the day
- Lot of action attempting to do @mention users at the same time
Have not attempted to repro in-house.
Expected Results
Jira handles the start-of-the-day login pressure without hiccups
Actual Results
- High load average:
- BLOCKED threads at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag
thread_dump_9.txt 85 thread_dump_10.txt 84 thread_dump_11.txt 85 thread_dump_12.txt 84 thread_dump_13.txt 84 thread_dump_14.txt 85
- These threads are blocked by one WAITING thread which needs to get a WriteLock at at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.refreshSearcher(DirectoryUserIndexer.java:127)
Notes
Related thread dumps.
- Synchronised Threads waiting during authentication:
"http-nio-9080-exec-637" #261926 daemon prio=5 os_prio=0 tid=0x00007fb344586ef0 nid=0x5df0 waiting for monitor entry [0x00007fb28d24b000] java.lang.Thread.State: BLOCKED (on object monitor) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.update(IndexedUserDao.java:318) - waiting to lock <0x0000000145c5e268> (a java.lang.Object) at com.atlassian.jira.crowd.embedded.ofbiz.DelegatingUserDao.update(DelegatingUserDao.java:98) at com.atlassian.jira.crowd.embedded.ofbiz.SwitchingUserDao.update(SwitchingUserDao.java:30) at com.atlassian.crowd.directory.CachingDirectory.updateUser(CachingDirectory.java:142) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag(DbCachingRemoteDirectory.java:324) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserFromRemoteDirectory(DbCachingRemoteDirectory.java:266) at com.atlassian.crowd.directory.RemoteDirectory.userAuthenticated(RemoteDirectory.java:592) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.userAuthenticated(DbCachingRemoteDirectory.java:281) at com.atlassian.crowd.manager.directory.DirectoryManagerGeneric.userAuthenticated(DirectoryManagerGeneric.java:278) at com.atlassian.crowd.manager.application.ApplicationServiceGeneric.userAuthenticated(ApplicationServiceGeneric.java:2357) at com.atlassian.crowd.embedded.core.CrowdServiceImpl.userAuthenticated(CrowdServiceImpl.java:109) ...
- Thread which waits for WriteLock at DirectoryUserIndexer.refreshSearcher
"http-nio-9080-exec-611" #256200 daemon prio=5 os_prio=0 tid=0x00007fb3503bfaf0 nid=0xe398 waiting on condition [0x00007fb28ce47000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000145c5e1e8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.refreshSearcher(DirectoryUserIndexer.java:127) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.index(DirectoryUserIndexer.java:159) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.update(IndexedUserDao.java:320) - locked <0x0000000145c5e268> (a java.lang.Object) at com.atlassian.jira.crowd.embedded.ofbiz.DelegatingUserDao.update(DelegatingUserDao.java:98) at com.atlassian.jira.crowd.embedded.ofbiz.SwitchingUserDao.update(SwitchingUserDao.java:30) at com.atlassian.crowd.directory.CachingDirectory.updateUser(CachingDirectory.java:142) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag(DbCachingRemoteDirectory.java:324) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserFromRemoteDirectory(DbCachingRemoteDirectory.java:266) at com.atlassian.crowd.directory.RemoteDirectory.userAuthenticated(RemoteDirectory.java:592) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.userAuthenticated(DbCachingRemoteDirectory.java:281) at com.atlassian.crowd.manager.directory.DirectoryManagerGeneric.userAuthenticated(DirectoryManagerGeneric.java:278) at com.atlassian.crowd.manager.application.ApplicationServiceGeneric.userAuthenticated(ApplicationServiceGeneric.java:2357) at com.atlassian.crowd.embedded.core.CrowdServiceImpl.userAuthenticated(CrowdServiceImpl.java:109) [...]
- RUNNABLE thread holding the ReadLock and executing the seach
"http-nio-9080-exec-635" #261922 daemon prio=5 os_prio=0 tid=0x00007fb35030d5f0 nid=0x5dec runnable [0x00007fb28e3ab000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.store.RAMFile.numBuffers(RAMFile.java:68) - locked <0x00000001d3000040> (a org.apache.lucene.store.RAMFile) at org.apache.lucene.store.RAMInputStream.setCurrentBuffer(RAMInputStream.java:125) at org.apache.lucene.store.RAMInputStream.nextBuffer(RAMInputStream.java:119) .... at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.internalSearch(DirectoryUserIndexer.java:262) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.search(DirectoryUserIndexer.java:228) ... at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.trySearching(IndexedUserDao.java:426) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.search(IndexedUserDao.java:417) ...
- Note: DirectoryUserIndexer.internalSearch
Workaround
Since problem is caused by a number of factors, we suggest to have multi-factor approach to reduce it:
- Descrese number of users in the system (if possible), to reduce the cost of computing @mention
- Increase CPU cores to improve the time spent in the lock
- Identify bot user accounts creating unnatural login pressure
- Increase session timeout or remove bot-killer to reduce frequency of users needing to login/authenticate
- is related to
-
JSWSERVER-20336 Searching and Mentioning users may cause performance issues and high CPU load
-
- Closed
-
- mentioned in
-
Page Failed to load
-
Page Failed to load
-
Page Failed to load
-
Page Failed to load
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
Form Name |
---|
[JSWSERVER-20844] Severe performance degradation for user mentions and user login action due to contention in IndexedUserDao
Remote Link | Original: This issue links to "Page (Atlassian Documentation)" [ 610533 ] |
Remote Link | Original: This issue links to "PSR-601 (Bulldog)" [ 533886 ] | New: This issue links to "PSR-601 (JIRA Server (Bulldog))" [ 533886 ] |
Remote Link | Original: This issue links to "LOGIN-1 (Bulldog)" [ 594402 ] | New: This issue links to "LOGIN-1 (JIRA Server (Bulldog))" [ 594402 ] |
Remote Link | New: This issue links to "Page (Confluence)" [ 674833 ] |
Description |
Original:
{panel:title=Fix|bgColor=#FFFFCE}
The main fix for this problem was delivered in 8.19.1 where we got rid of the contention caused by mentions. Note that in 8.22.0 we have removed the contention caused by user login action. See [ {panel} h3. Issue Summary Mostly during the start of the day (when Jira has heavy login pressure from users), customer experienced a severe performance degradation (up to the point that system looks unresponsive). Upon further digging into thread dumps, we found 2 sets of problems: # A large number of threads waiting for *ReadLock* (action - /rest/internal/2/user/mention/search) # At the same time, a group of thread waiting for a *WriteLock* (action - /plugins/servlet/samlconsumer) both on {{DirectoryUserIndexer.java}}. This is caused by contention in the code and leads to situation, where a lot of threads (waiting for a ReadLock) are waiting on one thread that's supposed to get a WriteLock to update the {{UserIndex}}, but can't as it also waits on another thread which is currently having a ReadLock on {{UserIndex}} and doing the search. This causes cascading contention and degrades user experience in Jira for other actions (see h3. Steps to Reproduce Details from the customer instance: * 300k+ user * Massive amount of users logging in simultaneously at the start of the day * Lot of action attempting to do {{@mention}} users at the same time Have not attempted to repro in-house. h3. Expected Results Jira handles the start-of-the-day login pressure without hiccups h3. Actual Results * High load average: !2020-11-25_20-07-42.png|thumbnail! * BLOCKED threads at {{com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag}} {code} thread_dump_9.txt 85 thread_dump_10.txt 84 thread_dump_11.txt 85 thread_dump_12.txt 84 thread_dump_13.txt 84 thread_dump_14.txt 85 {code} * These threads are blocked by one WAITING thread which needs to get a WriteLock at at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.refreshSearcher(DirectoryUserIndexer.java:127) h3. Notes Related thread dumps. * *Synchronised* Threads waiting during authentication: {noformat} "http-nio-9080-exec-637" #261926 daemon prio=5 os_prio=0 tid=0x00007fb344586ef0 nid=0x5df0 waiting for monitor entry [0x00007fb28d24b000] java.lang.Thread.State: BLOCKED (on object monitor) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.update(IndexedUserDao.java:318) - waiting to lock <0x0000000145c5e268> (a java.lang.Object) at com.atlassian.jira.crowd.embedded.ofbiz.DelegatingUserDao.update(DelegatingUserDao.java:98) at com.atlassian.jira.crowd.embedded.ofbiz.SwitchingUserDao.update(SwitchingUserDao.java:30) at com.atlassian.crowd.directory.CachingDirectory.updateUser(CachingDirectory.java:142) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag(DbCachingRemoteDirectory.java:324) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserFromRemoteDirectory(DbCachingRemoteDirectory.java:266) at com.atlassian.crowd.directory.RemoteDirectory.userAuthenticated(RemoteDirectory.java:592) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.userAuthenticated(DbCachingRemoteDirectory.java:281) at com.atlassian.crowd.manager.directory.DirectoryManagerGeneric.userAuthenticated(DirectoryManagerGeneric.java:278) at com.atlassian.crowd.manager.application.ApplicationServiceGeneric.userAuthenticated(ApplicationServiceGeneric.java:2357) at com.atlassian.crowd.embedded.core.CrowdServiceImpl.userAuthenticated(CrowdServiceImpl.java:109) ... {noformat} * Thread which waits for *WriteLock* at {{DirectoryUserIndexer.refreshSearcher}} {noformat} "http-nio-9080-exec-611" #256200 daemon prio=5 os_prio=0 tid=0x00007fb3503bfaf0 nid=0xe398 waiting on condition [0x00007fb28ce47000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000145c5e1e8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.refreshSearcher(DirectoryUserIndexer.java:127) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.index(DirectoryUserIndexer.java:159) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.update(IndexedUserDao.java:320) - locked <0x0000000145c5e268> (a java.lang.Object) at com.atlassian.jira.crowd.embedded.ofbiz.DelegatingUserDao.update(DelegatingUserDao.java:98) at com.atlassian.jira.crowd.embedded.ofbiz.SwitchingUserDao.update(SwitchingUserDao.java:30) at com.atlassian.crowd.directory.CachingDirectory.updateUser(CachingDirectory.java:142) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag(DbCachingRemoteDirectory.java:324) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserFromRemoteDirectory(DbCachingRemoteDirectory.java:266) at com.atlassian.crowd.directory.RemoteDirectory.userAuthenticated(RemoteDirectory.java:592) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.userAuthenticated(DbCachingRemoteDirectory.java:281) at com.atlassian.crowd.manager.directory.DirectoryManagerGeneric.userAuthenticated(DirectoryManagerGeneric.java:278) at com.atlassian.crowd.manager.application.ApplicationServiceGeneric.userAuthenticated(ApplicationServiceGeneric.java:2357) at com.atlassian.crowd.embedded.core.CrowdServiceImpl.userAuthenticated(CrowdServiceImpl.java:109) [...] {noformat} * RUNNABLE thread holding the *ReadLock* and executing the seach {noformat} "http-nio-9080-exec-635" #261922 daemon prio=5 os_prio=0 tid=0x00007fb35030d5f0 nid=0x5dec runnable [0x00007fb28e3ab000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.store.RAMFile.numBuffers(RAMFile.java:68) - locked <0x00000001d3000040> (a org.apache.lucene.store.RAMFile) at org.apache.lucene.store.RAMInputStream.setCurrentBuffer(RAMInputStream.java:125) at org.apache.lucene.store.RAMInputStream.nextBuffer(RAMInputStream.java:119) .... at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.internalSearch(DirectoryUserIndexer.java:262) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.search(DirectoryUserIndexer.java:228) ... at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.trySearching(IndexedUserDao.java:426) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.search(IndexedUserDao.java:417) ... {noformat} ** Note: {{DirectoryUserIndexer.internalSearch}} h3. Workaround Since problem is caused by a number of factors, we suggest to have multi-factor approach to reduce it: * Descrese number of users in the system (if possible), to reduce the cost of computing {{@mention}} * Increase CPU cores to improve the time spent in the lock * Identify bot user accounts creating unnatural login pressure * Increase session timeout or remove bot-killer to reduce frequency of users needing to login/authenticate |
New:
{panel:title=Fix|bgColor=#FFFFCE}
The main fix for this problem was delivered in 8.19.1 where we got rid of the contention caused by mentions. Note that in 8.22.0/8.20.8 we have removed the contention caused by user login action. See [ {panel} h3. Issue Summary Mostly during the start of the day (when Jira has heavy login pressure from users), customer experienced a severe performance degradation (up to the point that system looks unresponsive). Upon further digging into thread dumps, we found 2 sets of problems: # A large number of threads waiting for *ReadLock* (action - /rest/internal/2/user/mention/search) # At the same time, a group of thread waiting for a *WriteLock* (action - /plugins/servlet/samlconsumer) both on {{DirectoryUserIndexer.java}}. This is caused by contention in the code and leads to situation, where a lot of threads (waiting for a ReadLock) are waiting on one thread that's supposed to get a WriteLock to update the {{UserIndex}}, but can't as it also waits on another thread which is currently having a ReadLock on {{UserIndex}} and doing the search. This causes cascading contention and degrades user experience in Jira for other actions (see h3. Steps to Reproduce Details from the customer instance: * 300k+ user * Massive amount of users logging in simultaneously at the start of the day * Lot of action attempting to do {{@mention}} users at the same time Have not attempted to repro in-house. h3. Expected Results Jira handles the start-of-the-day login pressure without hiccups h3. Actual Results * High load average: !2020-11-25_20-07-42.png|thumbnail! * BLOCKED threads at {{com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag}} {code} thread_dump_9.txt 85 thread_dump_10.txt 84 thread_dump_11.txt 85 thread_dump_12.txt 84 thread_dump_13.txt 84 thread_dump_14.txt 85 {code} * These threads are blocked by one WAITING thread which needs to get a WriteLock at at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.refreshSearcher(DirectoryUserIndexer.java:127) h3. Notes Related thread dumps. * *Synchronised* Threads waiting during authentication: {noformat} "http-nio-9080-exec-637" #261926 daemon prio=5 os_prio=0 tid=0x00007fb344586ef0 nid=0x5df0 waiting for monitor entry [0x00007fb28d24b000] java.lang.Thread.State: BLOCKED (on object monitor) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.update(IndexedUserDao.java:318) - waiting to lock <0x0000000145c5e268> (a java.lang.Object) at com.atlassian.jira.crowd.embedded.ofbiz.DelegatingUserDao.update(DelegatingUserDao.java:98) at com.atlassian.jira.crowd.embedded.ofbiz.SwitchingUserDao.update(SwitchingUserDao.java:30) at com.atlassian.crowd.directory.CachingDirectory.updateUser(CachingDirectory.java:142) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag(DbCachingRemoteDirectory.java:324) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserFromRemoteDirectory(DbCachingRemoteDirectory.java:266) at com.atlassian.crowd.directory.RemoteDirectory.userAuthenticated(RemoteDirectory.java:592) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.userAuthenticated(DbCachingRemoteDirectory.java:281) at com.atlassian.crowd.manager.directory.DirectoryManagerGeneric.userAuthenticated(DirectoryManagerGeneric.java:278) at com.atlassian.crowd.manager.application.ApplicationServiceGeneric.userAuthenticated(ApplicationServiceGeneric.java:2357) at com.atlassian.crowd.embedded.core.CrowdServiceImpl.userAuthenticated(CrowdServiceImpl.java:109) ... {noformat} * Thread which waits for *WriteLock* at {{DirectoryUserIndexer.refreshSearcher}} {noformat} "http-nio-9080-exec-611" #256200 daemon prio=5 os_prio=0 tid=0x00007fb3503bfaf0 nid=0xe398 waiting on condition [0x00007fb28ce47000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000145c5e1e8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.refreshSearcher(DirectoryUserIndexer.java:127) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.index(DirectoryUserIndexer.java:159) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.update(IndexedUserDao.java:320) - locked <0x0000000145c5e268> (a java.lang.Object) at com.atlassian.jira.crowd.embedded.ofbiz.DelegatingUserDao.update(DelegatingUserDao.java:98) at com.atlassian.jira.crowd.embedded.ofbiz.SwitchingUserDao.update(SwitchingUserDao.java:30) at com.atlassian.crowd.directory.CachingDirectory.updateUser(CachingDirectory.java:142) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag(DbCachingRemoteDirectory.java:324) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserFromRemoteDirectory(DbCachingRemoteDirectory.java:266) at com.atlassian.crowd.directory.RemoteDirectory.userAuthenticated(RemoteDirectory.java:592) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.userAuthenticated(DbCachingRemoteDirectory.java:281) at com.atlassian.crowd.manager.directory.DirectoryManagerGeneric.userAuthenticated(DirectoryManagerGeneric.java:278) at com.atlassian.crowd.manager.application.ApplicationServiceGeneric.userAuthenticated(ApplicationServiceGeneric.java:2357) at com.atlassian.crowd.embedded.core.CrowdServiceImpl.userAuthenticated(CrowdServiceImpl.java:109) [...] {noformat} * RUNNABLE thread holding the *ReadLock* and executing the seach {noformat} "http-nio-9080-exec-635" #261922 daemon prio=5 os_prio=0 tid=0x00007fb35030d5f0 nid=0x5dec runnable [0x00007fb28e3ab000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.store.RAMFile.numBuffers(RAMFile.java:68) - locked <0x00000001d3000040> (a org.apache.lucene.store.RAMFile) at org.apache.lucene.store.RAMInputStream.setCurrentBuffer(RAMInputStream.java:125) at org.apache.lucene.store.RAMInputStream.nextBuffer(RAMInputStream.java:119) .... at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.internalSearch(DirectoryUserIndexer.java:262) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.search(DirectoryUserIndexer.java:228) ... at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.trySearching(IndexedUserDao.java:426) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.search(IndexedUserDao.java:417) ... {noformat} ** Note: {{DirectoryUserIndexer.internalSearch}} h3. Workaround Since problem is caused by a number of factors, we suggest to have multi-factor approach to reduce it: * Descrese number of users in the system (if possible), to reduce the cost of computing {{@mention}} * Increase CPU cores to improve the time spent in the lock * Identify bot user accounts creating unnatural login pressure * Increase session timeout or remove bot-killer to reduce frequency of users needing to login/authenticate |
Remote Link | New: This issue links to "Page (Confluence)" [ 628580 ] |
Description |
Original:
h3. Issue Summary
Mostly during the start of the day (when Jira has heavy login pressure from users), customer experienced a severe performance degradation (up to the point that system looks unresponsive). Upon further digging into thread dumps, we found 2 sets of problems: # A large number of threads waiting for *ReadLock* (action - /rest/internal/2/user/mention/search) # At the same time, a group of thread waiting for a *WriteLock* (action - /plugins/servlet/samlconsumer) both on {{DirectoryUserIndexer.java}}. This is caused by contention in the code and leads to situation, where a lot of threads (waiting for a ReadLock) are waiting on one thread that's supposed to get a WriteLock to update the {{UserIndex}}, but can't as it also waits on another thread which is currently having a ReadLock on {{UserIndex}} and doing the search. This causes cascading contention and degrades user experience in Jira for other actions (see h3. Steps to Reproduce Details from the customer instance: * 300k+ user * Massive amount of users logging in simultaneously at the start of the day * Lot of action attempting to do {{@mention}} users at the same time Have not attempted to repro in-house. h3. Expected Results Jira handles the start-of-the-day login pressure without hiccups h3. Actual Results * High load average: !2020-11-25_20-07-42.png|thumbnail! * BLOCKED threads at {{com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag}} {code} thread_dump_9.txt 85 thread_dump_10.txt 84 thread_dump_11.txt 85 thread_dump_12.txt 84 thread_dump_13.txt 84 thread_dump_14.txt 85 {code} * These threads are blocked by one WAITING thread which needs to get a WriteLock at at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.refreshSearcher(DirectoryUserIndexer.java:127) h3. Notes Related thread dumps. * *Synchronised* Threads waiting during authentication: {noformat} "http-nio-9080-exec-637" #261926 daemon prio=5 os_prio=0 tid=0x00007fb344586ef0 nid=0x5df0 waiting for monitor entry [0x00007fb28d24b000] java.lang.Thread.State: BLOCKED (on object monitor) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.update(IndexedUserDao.java:318) - waiting to lock <0x0000000145c5e268> (a java.lang.Object) at com.atlassian.jira.crowd.embedded.ofbiz.DelegatingUserDao.update(DelegatingUserDao.java:98) at com.atlassian.jira.crowd.embedded.ofbiz.SwitchingUserDao.update(SwitchingUserDao.java:30) at com.atlassian.crowd.directory.CachingDirectory.updateUser(CachingDirectory.java:142) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag(DbCachingRemoteDirectory.java:324) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserFromRemoteDirectory(DbCachingRemoteDirectory.java:266) at com.atlassian.crowd.directory.RemoteDirectory.userAuthenticated(RemoteDirectory.java:592) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.userAuthenticated(DbCachingRemoteDirectory.java:281) at com.atlassian.crowd.manager.directory.DirectoryManagerGeneric.userAuthenticated(DirectoryManagerGeneric.java:278) at com.atlassian.crowd.manager.application.ApplicationServiceGeneric.userAuthenticated(ApplicationServiceGeneric.java:2357) at com.atlassian.crowd.embedded.core.CrowdServiceImpl.userAuthenticated(CrowdServiceImpl.java:109) ... {noformat} * Thread which waits for *WriteLock* at {{DirectoryUserIndexer.refreshSearcher}} {noformat} "http-nio-9080-exec-611" #256200 daemon prio=5 os_prio=0 tid=0x00007fb3503bfaf0 nid=0xe398 waiting on condition [0x00007fb28ce47000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000145c5e1e8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.refreshSearcher(DirectoryUserIndexer.java:127) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.index(DirectoryUserIndexer.java:159) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.update(IndexedUserDao.java:320) - locked <0x0000000145c5e268> (a java.lang.Object) at com.atlassian.jira.crowd.embedded.ofbiz.DelegatingUserDao.update(DelegatingUserDao.java:98) at com.atlassian.jira.crowd.embedded.ofbiz.SwitchingUserDao.update(SwitchingUserDao.java:30) at com.atlassian.crowd.directory.CachingDirectory.updateUser(CachingDirectory.java:142) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag(DbCachingRemoteDirectory.java:324) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserFromRemoteDirectory(DbCachingRemoteDirectory.java:266) at com.atlassian.crowd.directory.RemoteDirectory.userAuthenticated(RemoteDirectory.java:592) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.userAuthenticated(DbCachingRemoteDirectory.java:281) at com.atlassian.crowd.manager.directory.DirectoryManagerGeneric.userAuthenticated(DirectoryManagerGeneric.java:278) at com.atlassian.crowd.manager.application.ApplicationServiceGeneric.userAuthenticated(ApplicationServiceGeneric.java:2357) at com.atlassian.crowd.embedded.core.CrowdServiceImpl.userAuthenticated(CrowdServiceImpl.java:109) [...] {noformat} * RUNNABLE thread holding the *ReadLock* and executing the seach {noformat} "http-nio-9080-exec-635" #261922 daemon prio=5 os_prio=0 tid=0x00007fb35030d5f0 nid=0x5dec runnable [0x00007fb28e3ab000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.store.RAMFile.numBuffers(RAMFile.java:68) - locked <0x00000001d3000040> (a org.apache.lucene.store.RAMFile) at org.apache.lucene.store.RAMInputStream.setCurrentBuffer(RAMInputStream.java:125) at org.apache.lucene.store.RAMInputStream.nextBuffer(RAMInputStream.java:119) .... at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.internalSearch(DirectoryUserIndexer.java:262) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.search(DirectoryUserIndexer.java:228) ... at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.trySearching(IndexedUserDao.java:426) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.search(IndexedUserDao.java:417) ... {noformat} ** Note: {{DirectoryUserIndexer.internalSearch}} h3. Workaround Since problem is caused by a number of factors, we suggest to have multi-factor approach to reduce it: * Descrese number of users in the system (if possible), to reduce the cost of computing {{@mention}} * Increase CPU cores to improve the time spent in the lock * Identify bot user accounts creating unnatural login pressure * Increase session timeout or remove bot-killer to reduce frequency of users needing to login/authenticate |
New:
{panel:title=Fix|bgColor=#FFFFCE}
The main fix for this problem was delivered in 8.19.1 where we got rid of the contention caused by mentions. Note that in 8.22.0 we have removed the contention caused by user login action. See [ {panel} h3. Issue Summary Mostly during the start of the day (when Jira has heavy login pressure from users), customer experienced a severe performance degradation (up to the point that system looks unresponsive). Upon further digging into thread dumps, we found 2 sets of problems: # A large number of threads waiting for *ReadLock* (action - /rest/internal/2/user/mention/search) # At the same time, a group of thread waiting for a *WriteLock* (action - /plugins/servlet/samlconsumer) both on {{DirectoryUserIndexer.java}}. This is caused by contention in the code and leads to situation, where a lot of threads (waiting for a ReadLock) are waiting on one thread that's supposed to get a WriteLock to update the {{UserIndex}}, but can't as it also waits on another thread which is currently having a ReadLock on {{UserIndex}} and doing the search. This causes cascading contention and degrades user experience in Jira for other actions (see h3. Steps to Reproduce Details from the customer instance: * 300k+ user * Massive amount of users logging in simultaneously at the start of the day * Lot of action attempting to do {{@mention}} users at the same time Have not attempted to repro in-house. h3. Expected Results Jira handles the start-of-the-day login pressure without hiccups h3. Actual Results * High load average: !2020-11-25_20-07-42.png|thumbnail! * BLOCKED threads at {{com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag}} {code} thread_dump_9.txt 85 thread_dump_10.txt 84 thread_dump_11.txt 85 thread_dump_12.txt 84 thread_dump_13.txt 84 thread_dump_14.txt 85 {code} * These threads are blocked by one WAITING thread which needs to get a WriteLock at at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.refreshSearcher(DirectoryUserIndexer.java:127) h3. Notes Related thread dumps. * *Synchronised* Threads waiting during authentication: {noformat} "http-nio-9080-exec-637" #261926 daemon prio=5 os_prio=0 tid=0x00007fb344586ef0 nid=0x5df0 waiting for monitor entry [0x00007fb28d24b000] java.lang.Thread.State: BLOCKED (on object monitor) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.update(IndexedUserDao.java:318) - waiting to lock <0x0000000145c5e268> (a java.lang.Object) at com.atlassian.jira.crowd.embedded.ofbiz.DelegatingUserDao.update(DelegatingUserDao.java:98) at com.atlassian.jira.crowd.embedded.ofbiz.SwitchingUserDao.update(SwitchingUserDao.java:30) at com.atlassian.crowd.directory.CachingDirectory.updateUser(CachingDirectory.java:142) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag(DbCachingRemoteDirectory.java:324) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserFromRemoteDirectory(DbCachingRemoteDirectory.java:266) at com.atlassian.crowd.directory.RemoteDirectory.userAuthenticated(RemoteDirectory.java:592) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.userAuthenticated(DbCachingRemoteDirectory.java:281) at com.atlassian.crowd.manager.directory.DirectoryManagerGeneric.userAuthenticated(DirectoryManagerGeneric.java:278) at com.atlassian.crowd.manager.application.ApplicationServiceGeneric.userAuthenticated(ApplicationServiceGeneric.java:2357) at com.atlassian.crowd.embedded.core.CrowdServiceImpl.userAuthenticated(CrowdServiceImpl.java:109) ... {noformat} * Thread which waits for *WriteLock* at {{DirectoryUserIndexer.refreshSearcher}} {noformat} "http-nio-9080-exec-611" #256200 daemon prio=5 os_prio=0 tid=0x00007fb3503bfaf0 nid=0xe398 waiting on condition [0x00007fb28ce47000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000145c5e1e8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.refreshSearcher(DirectoryUserIndexer.java:127) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.index(DirectoryUserIndexer.java:159) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.update(IndexedUserDao.java:320) - locked <0x0000000145c5e268> (a java.lang.Object) at com.atlassian.jira.crowd.embedded.ofbiz.DelegatingUserDao.update(DelegatingUserDao.java:98) at com.atlassian.jira.crowd.embedded.ofbiz.SwitchingUserDao.update(SwitchingUserDao.java:30) at com.atlassian.crowd.directory.CachingDirectory.updateUser(CachingDirectory.java:142) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserAndSetActiveFlag(DbCachingRemoteDirectory.java:324) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.updateUserFromRemoteDirectory(DbCachingRemoteDirectory.java:266) at com.atlassian.crowd.directory.RemoteDirectory.userAuthenticated(RemoteDirectory.java:592) at com.atlassian.crowd.directory.DbCachingRemoteDirectory.userAuthenticated(DbCachingRemoteDirectory.java:281) at com.atlassian.crowd.manager.directory.DirectoryManagerGeneric.userAuthenticated(DirectoryManagerGeneric.java:278) at com.atlassian.crowd.manager.application.ApplicationServiceGeneric.userAuthenticated(ApplicationServiceGeneric.java:2357) at com.atlassian.crowd.embedded.core.CrowdServiceImpl.userAuthenticated(CrowdServiceImpl.java:109) [...] {noformat} * RUNNABLE thread holding the *ReadLock* and executing the seach {noformat} "http-nio-9080-exec-635" #261922 daemon prio=5 os_prio=0 tid=0x00007fb35030d5f0 nid=0x5dec runnable [0x00007fb28e3ab000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.store.RAMFile.numBuffers(RAMFile.java:68) - locked <0x00000001d3000040> (a org.apache.lucene.store.RAMFile) at org.apache.lucene.store.RAMInputStream.setCurrentBuffer(RAMInputStream.java:125) at org.apache.lucene.store.RAMInputStream.nextBuffer(RAMInputStream.java:119) .... at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.internalSearch(DirectoryUserIndexer.java:262) at com.atlassian.jira.bc.user.search.DirectoryUserIndexer.search(DirectoryUserIndexer.java:228) ... at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.trySearching(IndexedUserDao.java:426) at com.atlassian.jira.crowd.embedded.ofbiz.IndexedUserDao.search(IndexedUserDao.java:417) ... {noformat} ** Note: {{DirectoryUserIndexer.internalSearch}} h3. Workaround Since problem is caused by a number of factors, we suggest to have multi-factor approach to reduce it: * Descrese number of users in the system (if possible), to reduce the cost of computing {{@mention}} * Increase CPU cores to improve the time spent in the lock * Identify bot user accounts creating unnatural login pressure * Increase session timeout or remove bot-killer to reduce frequency of users needing to login/authenticate |
Remote Link | New: This issue links to "Page (Confluence)" [ 611899 ] |
Labels | Original: deltaUserCache deltaUserLogin performance performance-scalability pse-request | New: deltaUserCache deltaUserLogin fixedByDelta performance performance-scalability pse-request |
Remote Link | New: This issue links to "Page (Atlassian Documentation)" [ 610533 ] |