Loading...

Type: Bug
Resolution: Fixed
Priority: Low
Fix Version/s: 7.2.9, 7.3.7
Affects Version/s: 7.2.2, 7.2.3, 7.2.7, 7.3.2
Component/s: Licensing
Labels:

Introduced in Version:
7.02
Support reference count:
14
Symptom Severity:
Severity 1 - Critical
UIS:
134

Summary

JIRA populates activeUsersCountForLicense cache in inefficient way

Environment

Large number of users: 10k+

// SQL to get all users
select directory_id,count(1) from cwd_user group by directory_id;    
//SQL to get active users
select directory_id,count(1) from cwd_user where active=1 group by directory_id;

Steps to Reproduce

TBD
Enable debug logging for com.atlassian.jira.application.DefaultApplicationRoleManager

Expected Results

JIRA populates the cache fast.

Actual Results

Cache population is slow, majority of the incoming http threads are waiting for getUserCount method (activeUsersCountForLicense cache) to be populated.

Enabling debug logging will show cache population time:

2017-03-10 12:16:16,700 http-nio-8080-exec-539 DEBUG user 736x6651698x30 1se 1.0.0.0 /secure/MyJiraHome.jspa [c.a.jira.application.DefaultApplicationRoleManager] JIRA Software count: 4800
2017-03-10 12:16:16,700 http-nio-8080-exec-539 DEBUG user 736x6651698x30 1se 1.0.0.0 /secure/MyJiraHome.jspa [c.a.jira.application.DefaultApplicationRoleManager] Count application users: 7.938 s JIRA Software

Notes

Note that JIRA with license JIRA Software (Server) Unlimited Users is not affected, as license check is always true.
(License is checked in 7.2.9 - ~~JRASERVER-65462~~)

Threads waiting for getUserCount method:

 grep -A 20 'http-nio' jira_threads.1488804924.txt | grep -A 19 'java.lang.Thread.State: WAITING' | grep -c 'com.atlassian.jira.application.DefaultApplicationRoleManager.getUserCount'
73

Example of the waiting thread (one of 73):

"http-nio-8080-exec-756" #365303 daemon prio=5 os_prio=0 tid=0x00007fda1801e000 nid=0x3664 waiting on condition [0x00007fde3abd3000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x000000055dce6220> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
..
	at com.atlassian.cache.ehcache.LoadingCache.get(LoadingCache.java:75)
	at com.atlassian.cache.ehcache.DelegatingCache.get(DelegatingCache.java:97)
	at com.atlassian.jira.application.DefaultApplicationRoleManager.getUserCount(DefaultApplicationRoleManager.java:407)
	at com.atlassian.jira.application.DefaultApplicationRoleManager.isRoleLimitExceeded(DefaultApplicationRoleManager.java:258)
	at com.atlassian.jira.application.ApplicationAuthorizationServiceImpl.getLicensingExactErrors(ApplicationAuthorizationServiceImpl.java:135)
	at com.atlassian.jira.application.ApplicationAuthorizationServiceImpl.getLicensingAccessErrors(ApplicationAuthorizationServiceImpl.java:78)
...
	at com.atlassian.greenhopper.license.SoftwareAuthorizationServiceImpl.getSoftwareLicensingAccessErrors(SoftwareAuthorizationServiceImpl.java:76)
	at com.atlassian.greenhopper.license.SoftwareAuthorizationServiceImpl.verifySoftwareLicense(SoftwareAuthorizationServiceImpl.java:50)
	at com.pyxis.greenhopper.jira.license.GreenHopperLicenseManagerImpl.verify(GreenHopperLicenseManagerImpl.java:73)
	at com.pyxis.greenhopper.jira.license.GreenHopperLicenseManagerImpl.isValid(GreenHopperLicenseManagerImpl.java:94)
	at com.atlassian.greenhopper.license.LicenseServiceImpl.getLicenseInfo(LicenseServiceImpl.java:46)
	at com.atlassian.greenhopper.license.LicenseServiceImpl.isLicensed(LicenseServiceImpl.java:66)
...

This bug can be amplificated by ~~JRA-64839~~. JIRA outage will be caused by combination of 2 factors:

Too many cache flush events, see ~~JRA-64839~~
- caused by allowing some user Application access
Slow cache population, current ticket

Those two thread will fight for cache, meantime all threads will wait (this instability could be 5+ min)

one will flush

"http-nio-8080-exec-828" #365452 daemon prio=5 os_prio=0 tid=0x00007fda18083000 nid=0x38e0 waiting on condition [0x00007fde368df000]
   java.lang.Thread.State: WAITING (parking)
...
	at com.atlassian.cache.ehcache.LoadingCache.removeAll(LoadingCache.java:320)
...
	at com.atlassian.jira.application.DefaultApplicationRoleManager.clearUserCounts(DefaultApplicationRoleManager.java:651)

another will populate:

"http-nio-8080-exec-831" #365455 daemon prio=5 os_prio=0 tid=0x00007fda540a8800 nid=0x38e3 runnable [0x00007fde36485000]
   java.lang.Thread.State: RUNNABLE
..
	at com.atlassian.jira.crowd.embedded.ofbiz.DelegatingUserDao.processUsers(DelegatingUserDao.java:58).
...
	at com.atlassian.cache.ehcache.LoadingCache.loadValueAndReleaseLock(LoadingCache.java:100)
	at com.atlassian.cache.ehcache.LoadingCache.get(LoadingCache.java:76)
	at com.atlassian.cache.ehcache.DelegatingCache.get(DelegatingCache.java:97)
	at com.atlassian.jira.application.DefaultApplicationRoleManager.getUserCount(DefaultApplicationRoleManager.java:407)

Workaround

If possible reduce total number of users not used in JIRA (delete user or reduce LDAP sync scope)
JIRA uses following SQL to load users in processUsers method:

SELECT U.ID, U.directory_id, U.user_name, U.lower_user_name, U.active, U.created_date, U.updated_date, U.first_name, U.lower_first_name, U.last_name, U.lower_last_name, U.display_name, U.lower_display_name, U.email_address, U.lower_email_address, U.CREDENTIAL, U.deleted_externally, U.EXTERNAL_ID, D.active, D.directory_position FROM cwd_user U INNER JOIN cwd_directory D ON U.directory_id = D.ID WHERE D.active <> 0 ORDER BY U.lower_user_name, D.directory_position;

Note on Fix

We partially mitigated the problem by implementing ~~JRASERVER-65352~~, this introduces timeout to the cache population operation and optional async loading.
We will monitor the problem and would review it again if it gets more traction.

is related to

JRASERVER-61054 User count cache queries itself during loading

Closed

JRASERVER-61759 JIRA performance degradation during directory sync

Closed

JRASERVER-67614 JMX Monitoring may take a long time to load during Jira Startup

Gathering Impact

relates to

JRASERVER-69913 Finding membership for groups which has duplicated name can cause slowness access issues to certain pages of Jira instance

Closed

JRASERVER-64839 JIRA generates cache flush event for each default group after enabling Application access

Closed

JRASERVER-71231 Checking users against license tier affects product performance

Closed

JRASERVER-73032 Altering group membership has performance impact on large instances

Gathering Impact

JRASERVER-65352 Add activeUsersCountForLicense cache population timeout

Closed

RUM-1767 Loading...

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(4 relates to, 13 mentioned in)

Details

Description

Summary

Environment

Steps to Reproduce

Expected Results

Actual Results

Notes

Workaround

Note on Fix

Attachments

Issue Links

Activity

People

Dates