-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Low
-
None
-
Affects Version/s: 5.3.2
-
Component/s: Directories
-
None
-
2
-
Severity 3 - Minor
-
1
Issue Summary
When a node fails to release the lock on the directory synchronisation process it initially started, the synchronisation process can continue running in the background. This issue can occur if the node holding the lock becomes non-operational or if its ID changes.
As a result, the synchronisation process remains active but can't be completed properly with the below messages in the application logs.
2024-07-12 15:10:00,103 Caesium-2-1 INFO [crowd.manager.directory.FailedSynchronisationManagerImpl] Found 1 stalled synchronisations for directories [ [xxxxxx] ]. Rescheduling them to run again 2024-07-12 15:15:00,026 Caesium-2-4 INFO [crowd.manager.directory.FailedSynchronisationManagerImpl] Found 1 stalled synchronisations for directories [ [xxxxx] ]. Rescheduling them to run again
There might be other locks in the cwd_cluster table which might be affected due to a similar situation.
This is reproducible on Data Center: Yes
Steps to Reproduce
- Start a Crowd cluster with two nodes and begin the synchronisation for the User Directory on one of the nodes.
- Bring down the node on which the User Directory synchronisation was started leaving only one available node in the cluster.
- The inactive node will hold the lock for the directory synchronisation in the below table(cwd_cluster_lock) and synchronisation will remain in progress :
select * from cwd_cluster_lock where lock_name like '%com.atlassian.crowd.embedded.api.Directory:DIRECTORY_ID%'; // Replace DIRECTORY_ID with the directory id select * from cwd_synchronisation_status where directory_id = 'xxxxx';
Expected Results
Once the node is inactive or if the node ID is changed, the directory synchronisation lock should be released and the User Directory synchronisation should continue.
Actual Results
The below message is thrown in the Crowd application logs(atlassian-crowd.log) and the User Directory synchronisation is stuck :
2024-07-12 15:10:00,103 Caesium-2-1 INFO [crowd.manager.directory.FailedSynchronisationManagerImpl] Found 1 stalled synchronisations for directories [ [xxxxxx] ]. Rescheduling them to run again 2024-07-12 15:15:00,026 Caesium-2-4 INFO [crowd.manager.directory.FailedSynchronisationManagerImpl] Found 1 stalled synchronisations for directories [ [xxxxx] ]. Rescheduling them to run again
Workaround
Stop one of the nodes and update the node ID in the below tables to reflect the current/live node in the database tables which are part of the cluster followed by starting the node :
update cwd_cluster_lock set node_id = '*******' where lock_name = 'com.atlassian.crowd.embedded.api.Directory:DIRECTORY_ID'; // Replace DIRECTORY_ID with the directory id update cwd_synchronisation_status set node_id = '*******' where directory_id = 'xxxxxx';