Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-70690

User login fails because inconsistent data / uk_mem_parent_child_type

    XMLWordPrintable

Details

    Description

      Issue Summary

      Data inconsistency result in users not able to login. There is a number of related cases when for all of them the root cause isn’t found, for most of them DB manipulation/refreshing cache/re-adding directory and syncing it solved the problem.

      This results in users unable to login, but flows/actions that could result in this state are plenty of. Mainly those are user login, directory Synchronisation.

      For some customers that works only temporary so the workaround is done over and over again.

      Inconsistency exists in DB between cwd_membership and cwd_user tables. 
      The most probable reason of it is cache corruption. It was confirmed in 1  case.

      There is a KnowledgeBase article for that, number of cases with similar symptoms and the same workaround steps.
      https://confluence.atlassian.com/jirakb/ldap-users-and-groups-display-unexpectedly-in-jira-server-776799341.html

      The biggest problem with pointing out the root cause is that problem show off with the delay(like a couple of months later), when there are no available logs providing information how data inconsistency was created (due to failed CRUD operation, or cache corruption -> that's why there are different workaround - manual DB update / new user-directory + resync & cache refresh)

      Steps to Reproduce

      1. Problem is not yet reproducible
      2. On affected/corrupted instances users are not able to login

      Expected Results

      An active user is able to login. If a problem occurs, sync AD with Jira solves the problem.
      -> There is no data inconsistency

      Actual Results

      The below exception is thrown in the Atlassian-jira.log file:

      ...
      2019-11-11 21:10:26,238 http-nio-9280-exec-17 ERROR      [c.a.j.web.servlet.InternalServerErrorServlet] {errorId=a0e52747-b31d-4200-a2ab-5da0eedddf5d, interpretedMsg=, cause=com.atlassian.jira.exception.DataAccessException: org.ofbiz.core.entity.GenericEntityException: 
      while inserting: [GenericEntity:Membership][lowerChildName,atariq][membershipType,GROUP_USER][parentName,SG-NETACCESS-CORP][childName,atariq][directoryId,10000][id,571581][childId,78974][lowerParentName,sg-netaccess-corp][parentId,14392] 
      (SQL Exception while executing the following:INSERT INTO cwd_membership (ID, parent_id, child_id, membership_type, group_type, parent_name, lower_parent_name, child_name, lower_child_name, directory_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) (Duplicate entry '14392-78974-GROUP_USER' for key 'uk_mem_parent_child_type')), 
      stacktrace=com.atlassian.jira.exception.DataAccessException: org.ofbiz.core.entity.GenericEntityException: while inserting: [GenericEntity:Membership][lowerChildName,atariq][membershipType,GROUP_USER][parentName,SG-NETACCESS-CORP][childName,atariq][directoryId,10000][id,571581][childId,78974][lowerParentName,sg-netaccess-corp][parentId,14392] (SQL Exception while executing the following:INSERT INTO cwd_membership (ID, parent_id, child_id, membership_type, group_type, parent_name, lower_parent_name, child_name, lower_child_name, directory_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) (Duplicate entry '14392-78974-GROUP_USER' for key 'uk_mem_parent_child_type'))
      

      Notes

      We have reports from a group of customers that suggests that problem was caused by staging/testing Jira instance sending the cache updates to the production cluster. In other words, production data in cwd_membership cache got poisoned by data from another environment.

      Workaround

      • Refresh caches by full Jira restart (all nodes in case of DC)
        • Full restart in Data Center is required because nodes with a corrupted cache will replicate it to other starting nodes, this means shutting down all nodes, validating the service is stopped and starting them back up, a rolling restart will not resolve this issue.
      • Add new directory, sync it, remove the 1st one
        • Create a new user directory with exact same settings as Directory ID: 10000 (name: Active Directory server
        • Synchronize with this new directory
        • Move this new directory to top position (the first directory to be checked)
        • Test user access
        • Remove old directory(id:10000)
      • Remove blocking(inconsistent data) user from cwd_membership and consequently update cwd_user: 
        We should be able to see if the DB is affected by running: 
        -- users
        select distinct M.lower_child_name
        from cwd_membership M
        join cwd_user U 
          on U.lower_user_name = M.lower_child_name 
          and U.directory_id = M.directory_id 
          and M.membership_type = 'GROUP_USER'
        where M.child_id != U.id;
        
        -- groups
        select distinct M.lower_child_name
        from cwd_membership M 
        join cwd_group G
          on G.lower_group_name = M.lower_child_name 
          and G.directory_id = M.directory_id 
          and M.membership_type = 'GROUP_GROUP'
        where M.child_id != G.id; 

        Following updates will remove any inconsistent data. Always have a backup of your database before doing any changes to it

        -- For Postgres
        DROP INDEX UK_MEM_PARENT_CHILD_TYPE;
        
        update cwd_membership
        set child_id = U.id
        from cwd_user U
        where cwd_membership.lower_child_name = U.lower_user_name
        and cwd_membership.directory_id = U.directory_id
        and cwd_membership.membership_type = 'GROUP_USER'
        and child_id != U.id;
        
        update cwd_membership
        set child_id = G.id
        from cwd_group G
        where cwd_membership.lower_child_name = G.lower_group_name
        and cwd_membership.directory_id = G.directory_id
        and cwd_membership.membership_type = 'GROUP_GROUP'
        and child_id != G.id;
        
        update cwd_membership
        set parent_id = G.id
        from cwd_group G
        where cwd_membership.lower_parent_name = G.lower_group_name
        and cwd_membership.directory_id = G.directory_id
        and parent_id != G.id;
        
        delete from cwd_membership 
        where lower_child_name not in
        (
          select lower_user_name 
          from cwd_user
        )
        and membership_type = 'GROUP_USER';
        
        delete from cwd_membership 
        where lower_child_name not in
        (
          select lower_group_name
          from cwd_group
        )
        and membership_type = 'GROUP_GROUP';
        
        delete from cwd_membership 
        where lower_parent_name not in 
        (
          select lower_group_name
          from cwd_group
        );
        
        create unique index uk_mem_parent_child_type
        on public.cwd_membership 
        using
        btree (parent_id, child_id, membership_type)
        

      A cold restart is needed to rebuilt the cache with the correct data.

      Attachments

        Issue Links

          Activity

            People

              mswinarski Maciej Swinarski (Inactive)
              afaruga@atlassian.com Artur Faruga
              Votes:
              25 Vote for this issue
              Watchers:
              60 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: