-
Bug
-
Resolution: Fixed
-
Low
-
7.2.2, 7.2.3, 7.2.7, 7.3.2
-
7.02
-
14
-
Severity 1 - Critical
-
134
-
Summary
JIRA populates activeUsersCountForLicense cache in inefficient way
Environment
- Large number of users: 10k+
// SQL to get all users select directory_id,count(1) from cwd_user group by directory_id; //SQL to get active users select directory_id,count(1) from cwd_user where active=1 group by directory_id;
Steps to Reproduce
- TBD
- Enable debug logging for com.atlassian.jira.application.DefaultApplicationRoleManager
Expected Results
JIRA populates the cache fast.
Actual Results
Cache population is slow, majority of the incoming http threads are waiting for getUserCount method (activeUsersCountForLicense cache) to be populated.
Enabling debug logging will show cache population time:
2017-03-10 12:16:16,700 http-nio-8080-exec-539 DEBUG user 736x6651698x30 1se 1.0.0.0 /secure/MyJiraHome.jspa [c.a.jira.application.DefaultApplicationRoleManager] JIRA Software count: 4800 2017-03-10 12:16:16,700 http-nio-8080-exec-539 DEBUG user 736x6651698x30 1se 1.0.0.0 /secure/MyJiraHome.jspa [c.a.jira.application.DefaultApplicationRoleManager] Count application users: 7.938 s JIRA Software
Notes
Note that JIRA with license JIRA Software (Server) Unlimited Users is not affected, as license check is always true.
(License is checked in 7.2.9 - JRASERVER-65462)
Threads waiting for getUserCount method:
grep -A 20 'http-nio' jira_threads.1488804924.txt | grep -A 19 'java.lang.Thread.State: WAITING' | grep -c 'com.atlassian.jira.application.DefaultApplicationRoleManager.getUserCount' 73
Example of the waiting thread (one of 73):
"http-nio-8080-exec-756" #365303 daemon prio=5 os_prio=0 tid=0x00007fda1801e000 nid=0x3664 waiting on condition [0x00007fde3abd3000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000055dce6220> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) .. at com.atlassian.cache.ehcache.LoadingCache.get(LoadingCache.java:75) at com.atlassian.cache.ehcache.DelegatingCache.get(DelegatingCache.java:97) at com.atlassian.jira.application.DefaultApplicationRoleManager.getUserCount(DefaultApplicationRoleManager.java:407) at com.atlassian.jira.application.DefaultApplicationRoleManager.isRoleLimitExceeded(DefaultApplicationRoleManager.java:258) at com.atlassian.jira.application.ApplicationAuthorizationServiceImpl.getLicensingExactErrors(ApplicationAuthorizationServiceImpl.java:135) at com.atlassian.jira.application.ApplicationAuthorizationServiceImpl.getLicensingAccessErrors(ApplicationAuthorizationServiceImpl.java:78) ... at com.atlassian.greenhopper.license.SoftwareAuthorizationServiceImpl.getSoftwareLicensingAccessErrors(SoftwareAuthorizationServiceImpl.java:76) at com.atlassian.greenhopper.license.SoftwareAuthorizationServiceImpl.verifySoftwareLicense(SoftwareAuthorizationServiceImpl.java:50) at com.pyxis.greenhopper.jira.license.GreenHopperLicenseManagerImpl.verify(GreenHopperLicenseManagerImpl.java:73) at com.pyxis.greenhopper.jira.license.GreenHopperLicenseManagerImpl.isValid(GreenHopperLicenseManagerImpl.java:94) at com.atlassian.greenhopper.license.LicenseServiceImpl.getLicenseInfo(LicenseServiceImpl.java:46) at com.atlassian.greenhopper.license.LicenseServiceImpl.isLicensed(LicenseServiceImpl.java:66) ...
This bug can be amplificated by JRA-64839. JIRA outage will be caused by combination of 2 factors:
- Too many cache flush events, see
JRA-64839- caused by allowing some user Application access
- Slow cache population, current ticket
- Those two thread will fight for cache, meantime all threads will wait (this instability could be 5+ min)
- one will flush
"http-nio-8080-exec-828" #365452 daemon prio=5 os_prio=0 tid=0x00007fda18083000 nid=0x38e0 waiting on condition [0x00007fde368df000] java.lang.Thread.State: WAITING (parking) ... at com.atlassian.cache.ehcache.LoadingCache.removeAll(LoadingCache.java:320) ... at com.atlassian.jira.application.DefaultApplicationRoleManager.clearUserCounts(DefaultApplicationRoleManager.java:651)
- one will flush
-
- another will populate:
"http-nio-8080-exec-831" #365455 daemon prio=5 os_prio=0 tid=0x00007fda540a8800 nid=0x38e3 runnable [0x00007fde36485000] java.lang.Thread.State: RUNNABLE .. at com.atlassian.jira.crowd.embedded.ofbiz.DelegatingUserDao.processUsers(DelegatingUserDao.java:58). ... at com.atlassian.cache.ehcache.LoadingCache.loadValueAndReleaseLock(LoadingCache.java:100) at com.atlassian.cache.ehcache.LoadingCache.get(LoadingCache.java:76) at com.atlassian.cache.ehcache.DelegatingCache.get(DelegatingCache.java:97) at com.atlassian.jira.application.DefaultApplicationRoleManager.getUserCount(DefaultApplicationRoleManager.java:407)
- another will populate:
Workaround
If possible reduce total number of users not used in JIRA (delete user or reduce LDAP sync scope)
JIRA uses following SQL to load users in processUsers method:
SELECT U.ID, U.directory_id, U.user_name, U.lower_user_name, U.active, U.created_date, U.updated_date, U.first_name, U.lower_first_name, U.last_name, U.lower_last_name, U.display_name, U.lower_display_name, U.email_address, U.lower_email_address, U.CREDENTIAL, U.deleted_externally, U.EXTERNAL_ID, D.active, D.directory_position FROM cwd_user U INNER JOIN cwd_directory D ON U.directory_id = D.ID WHERE D.active <> 0 ORDER BY U.lower_user_name, D.directory_position;
Note on Fix
We partially mitigated the problem by implementing JRASERVER-65352, this introduces timeout to the cache population operation and optional async loading.
We will monitor the problem and would review it again if it gets more traction.
- is related to
-
JRASERVER-61759 JIRA performance degradation during directory sync
- Closed
-
JRASERVER-61054 User count cache queries itself during loading
- Gathering Impact
-
JRASERVER-67614 JMX Monitoring may take a long time to load during Jira Startup
- Gathering Impact
- relates to
-
JRASERVER-69913 Finding membership for groups which has duplicated name can cause slowness access issues to certain pages of Jira instance
- Closed
-
JRASERVER-64839 JIRA generates cache flush event for each default group after enabling Application access
- Closed
-
JRASERVER-71231 Checking users against license tier affects product performance
- Closed
-
JRASERVER-73032 Altering group membership has performance impact on large instances
- Gathering Impact
-
JRASERVER-65352 Add activeUsersCountForLicense cache population timeout
- Closed
-
RUM-1767 Loading...
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...