Release lock for User Directory synchronisation process in cwd_cluster_lock table if assigned to an invalid node ID

XMLWordPrintable

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Low
    • None
    • Affects Version/s: 5.3.2
    • Component/s: Directories
    • None
    • 2
    • Severity 3 - Minor
    • 1

      Issue Summary

      When a node fails to release the lock on the directory synchronisation process it initially started, the synchronisation process can continue running in the background. This issue can occur if the node holding the lock becomes non-operational or if its ID changes.

      As a result, the synchronisation process remains active but can't be completed properly with the below messages in the application logs.

      2024-07-12 15:10:00,103 Caesium-2-1 INFO [crowd.manager.directory.FailedSynchronisationManagerImpl] Found 1 stalled synchronisations for directories [ [xxxxxx] ]. Rescheduling them to run again
      
      2024-07-12 15:15:00,026 Caesium-2-4 INFO [crowd.manager.directory.FailedSynchronisationManagerImpl] Found 1 stalled synchronisations for directories [ [xxxxx] ]. Rescheduling them to run again
       

      There might be other locks in the cwd_cluster table which might be affected due to a similar situation.

      This is reproducible on Data Center: Yes

      Steps to Reproduce

      1. Start a Crowd cluster with two nodes and begin the synchronisation for the User Directory on one of the nodes.
      2. Bring down the node on which the User Directory synchronisation was started leaving only one available node in the cluster.
      3. The inactive node will hold the lock for the directory synchronisation in the below table(cwd_cluster_lock) and synchronisation will remain in progress :
      select * from cwd_cluster_lock where lock_name like '%com.atlassian.crowd.embedded.api.Directory:DIRECTORY_ID%'; // Replace DIRECTORY_ID with the directory id
      
      select * from cwd_synchronisation_status where directory_id = 'xxxxx'

       

      Expected Results

      Once the node is inactive or if the node ID is changed, the directory synchronisation lock should be released and the User Directory synchronisation should continue.

      Actual Results

      The below message is thrown in the Crowd application logs(atlassian-crowd.log) and the User Directory synchronisation is stuck :

      2024-07-12 15:10:00,103 Caesium-2-1 INFO [crowd.manager.directory.FailedSynchronisationManagerImpl] Found 1 stalled synchronisations for directories [ [xxxxxx] ]. Rescheduling them to run again 
      2024-07-12 15:15:00,026 Caesium-2-4 INFO [crowd.manager.directory.FailedSynchronisationManagerImpl] Found 1 stalled synchronisations for directories [ [xxxxx] ]. Rescheduling them to run again 

      Workaround

      Stop one of the nodes and update the node ID in the below tables to reflect the current/live node in the database tables which are part of the cluster followed by starting the node :

      update cwd_cluster_lock set node_id = '*******' where lock_name = 'com.atlassian.crowd.embedded.api.Directory:DIRECTORY_ID'; // Replace DIRECTORY_ID with the directory id  
      
      update cwd_synchronisation_status set node_id = '*******' where directory_id = 'xxxxxx';  

            Assignee:
            Unassigned
            Reporter:
            Nitin Rastogi
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: