Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-64384

ActiveUsersCountForLicense cache is populated inefficiently leading to performance degradation across the application

    XMLWordPrintable

Details

    Description

      Summary

      JIRA populates activeUsersCountForLicense cache in inefficient way

      Environment

      • Large number of users: 10k+
        // SQL to get all users
        select directory_id,count(1) from cwd_user group by directory_id;    
        //SQL to get active users
        select directory_id,count(1) from cwd_user where active=1 group by directory_id;      
        

      Steps to Reproduce

      1. TBD
      2. Enable debug logging for com.atlassian.jira.application.DefaultApplicationRoleManager

      Expected Results

      JIRA populates the cache fast.

      Actual Results

      Cache population is slow, majority of the incoming http threads are waiting for getUserCount method (activeUsersCountForLicense cache) to be populated.

      Enabling debug logging will show cache population time:

      2017-03-10 12:16:16,700 http-nio-8080-exec-539 DEBUG user 736x6651698x30 1se 1.0.0.0 /secure/MyJiraHome.jspa [c.a.jira.application.DefaultApplicationRoleManager] JIRA Software count: 4800
      2017-03-10 12:16:16,700 http-nio-8080-exec-539 DEBUG user 736x6651698x30 1se 1.0.0.0 /secure/MyJiraHome.jspa [c.a.jira.application.DefaultApplicationRoleManager] Count application users: 7.938 s JIRA Software
      

      Notes

      Note that JIRA with license JIRA Software (Server) Unlimited Users is not affected, as license check is always true.
      (License is checked in 7.2.9 - JRASERVER-65462)

      Threads waiting for getUserCount method:

       grep -A 20 'http-nio' jira_threads.1488804924.txt | grep -A 19 'java.lang.Thread.State: WAITING' | grep -c 'com.atlassian.jira.application.DefaultApplicationRoleManager.getUserCount'
      73
      

      Example of the waiting thread (one of 73):

      "http-nio-8080-exec-756" #365303 daemon prio=5 os_prio=0 tid=0x00007fda1801e000 nid=0x3664 waiting on condition [0x00007fde3abd3000]
         java.lang.Thread.State: WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x000000055dce6220> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
      ..
      	at com.atlassian.cache.ehcache.LoadingCache.get(LoadingCache.java:75)
      	at com.atlassian.cache.ehcache.DelegatingCache.get(DelegatingCache.java:97)
      	at com.atlassian.jira.application.DefaultApplicationRoleManager.getUserCount(DefaultApplicationRoleManager.java:407)
      	at com.atlassian.jira.application.DefaultApplicationRoleManager.isRoleLimitExceeded(DefaultApplicationRoleManager.java:258)
      	at com.atlassian.jira.application.ApplicationAuthorizationServiceImpl.getLicensingExactErrors(ApplicationAuthorizationServiceImpl.java:135)
      	at com.atlassian.jira.application.ApplicationAuthorizationServiceImpl.getLicensingAccessErrors(ApplicationAuthorizationServiceImpl.java:78)
      ...
      	at com.atlassian.greenhopper.license.SoftwareAuthorizationServiceImpl.getSoftwareLicensingAccessErrors(SoftwareAuthorizationServiceImpl.java:76)
      	at com.atlassian.greenhopper.license.SoftwareAuthorizationServiceImpl.verifySoftwareLicense(SoftwareAuthorizationServiceImpl.java:50)
      	at com.pyxis.greenhopper.jira.license.GreenHopperLicenseManagerImpl.verify(GreenHopperLicenseManagerImpl.java:73)
      	at com.pyxis.greenhopper.jira.license.GreenHopperLicenseManagerImpl.isValid(GreenHopperLicenseManagerImpl.java:94)
      	at com.atlassian.greenhopper.license.LicenseServiceImpl.getLicenseInfo(LicenseServiceImpl.java:46)
      	at com.atlassian.greenhopper.license.LicenseServiceImpl.isLicensed(LicenseServiceImpl.java:66)
      ...
      

      This bug can be amplificated by JRA-64839. JIRA outage will be caused by combination of 2 factors:

      • Too many cache flush events, see JRA-64839
        • caused by allowing some user Application access
      • Slow cache population, current ticket
      • Those two thread will fight for cache, meantime all threads will wait (this instability could be 5+ min)
        • one will flush
          "http-nio-8080-exec-828" #365452 daemon prio=5 os_prio=0 tid=0x00007fda18083000 nid=0x38e0 waiting on condition [0x00007fde368df000]
             java.lang.Thread.State: WAITING (parking)
          ...
          	at com.atlassian.cache.ehcache.LoadingCache.removeAll(LoadingCache.java:320)
          ...
          	at com.atlassian.jira.application.DefaultApplicationRoleManager.clearUserCounts(DefaultApplicationRoleManager.java:651)
          
        • another will populate:
          "http-nio-8080-exec-831" #365455 daemon prio=5 os_prio=0 tid=0x00007fda540a8800 nid=0x38e3 runnable [0x00007fde36485000]
             java.lang.Thread.State: RUNNABLE
          ..
          	at com.atlassian.jira.crowd.embedded.ofbiz.DelegatingUserDao.processUsers(DelegatingUserDao.java:58).
          ...
          	at com.atlassian.cache.ehcache.LoadingCache.loadValueAndReleaseLock(LoadingCache.java:100)
          	at com.atlassian.cache.ehcache.LoadingCache.get(LoadingCache.java:76)
          	at com.atlassian.cache.ehcache.DelegatingCache.get(DelegatingCache.java:97)
          	at com.atlassian.jira.application.DefaultApplicationRoleManager.getUserCount(DefaultApplicationRoleManager.java:407)
          

      Workaround

      If possible reduce total number of users not used in JIRA (delete user or reduce LDAP sync scope)
      JIRA uses following SQL to load users in processUsers method:

      SELECT U.ID, U.directory_id, U.user_name, U.lower_user_name, U.active, U.created_date, U.updated_date, U.first_name, U.lower_first_name, U.last_name, U.lower_last_name, U.display_name, U.lower_display_name, U.email_address, U.lower_email_address, U.CREDENTIAL, U.deleted_externally, U.EXTERNAL_ID, D.active, D.directory_position FROM cwd_user U INNER JOIN cwd_directory D ON U.directory_id = D.ID WHERE D.active <> 0 ORDER BY U.lower_user_name, D.directory_position;
      

      Note on Fix

      We partially mitigated the problem by implementing JRASERVER-65352, this introduces timeout to the cache population operation and optional async loading.
      We will monitor the problem and would review it again if it gets more traction.
       
       

      Attachments

        Issue Links

          Activity

            People

              lwlodarczyk Lukasz Wlodarczyk
              ayakovlev@atlassian.com Andriy Yakovlev [Atlassian]
              Votes:
              4 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: