Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-42916

Stale node ids should automatically be removed in Jira Data Center

    • 657
    • 78
    • Hide
      Atlassian Update – 16 June 2020

      Hi everyone,

      Thank you for your votes and comments on this issue. We would like to inform you that this suggestion will be addressed in the upcoming Jira Data Center version 8.10.0 release.

      We’ve decided to provide more automated way of handling stale (No heartbeat) nodes in Jira Data Center. Before the changes, if a node lost connection to the cluster for 5 minutes, its state changed from “Active” to “No heartbeat”. If such node was not moved to the “Offline” state, it might cause performance degradation.

      We’ve automated this process and the solution is as follows:

      • If a node is in the “No heartbeat” state for longer than 2 days, it will be automatically moved to the “Offline” state. Admins will be informed about this via warning in atlassian-jira.log file and will see such state on the Clustering page. During this period you will be able to check the node or restart it.
      • If a node is in “Offline” state for longer than 2 days, it will be automatically removed from the cluster. Also, you will be informed about such action through the info logs in your atlassian-jira.log file.

      Additionally based on the feedback we received in the comments below, we will be adding in Jira Data Center version 8.11.0 a possibility of adjusting stale nodes retention period of 2 days. You can find more details about this suggestion under this thread.

      Moreover since Jira Data Center 8.6 we are bringing more visibility about nodes in your cluster by introducing Clustering page in the admin panel. In the newly released Jira Data Center version 8.9 we have extended this page by adding additional information about statuses of nodes (Active, No heartbeat, Offline) and Jira DC application status (maintenance, error, running, starting) in order to identify the stale nodes more easily.

      Lastly, the changes described above are integrated with the Advanced audit log functionality available in Jira Data Center since version 8.8. Any automatic actions will be logged to give admins more visibility what is happening on their instance. For more details please go here.

      Thank you for voting and commenting on this suggestion,
      Grażyna Kaszkur
      Product manager, Jira Server and Data Center

      Show
      Atlassian Update – 16 June 2020 Hi everyone, Thank you for your votes and comments on this issue. We would like to inform you that this suggestion will be addressed in the upcoming Jira Data Center version 8.10.0 release. We’ve decided to provide more automated way of handling stale (No heartbeat) nodes in Jira Data Center. Before the changes, if a node lost connection to the cluster for 5 minutes, its state changed from “Active” to “No heartbeat”. If such node was not moved to the “Offline” state, it might cause performance degradation. We’ve automated this process and the solution is as follows: If a node is in the “No heartbeat” state for longer than 2 days, it will be automatically moved to the “Offline” state. Admins will be informed about this via warning in atlassian-jira.log file and will see such state on the Clustering page. During this period you will be able to check the node or restart it. If a node is in “Offline” state for longer than 2 days, it will be automatically removed from the cluster. Also, you will be informed about such action through the info logs in your atlassian-jira.log file. Additionally based on the feedback we received in the comments below, we will be adding in Jira Data Center version 8.11.0 a possibility of adjusting stale nodes retention period of 2 days. You can find more details about this suggestion under this thread. Moreover since Jira Data Center 8.6 we are bringing more visibility about nodes in your cluster by introducing Clustering page in the admin panel . In the newly released Jira Data Center version 8.9 we have extended this page by adding additional information about statuses of nodes (Active, No heartbeat, Offline) and Jira DC application status (maintenance, error, running, starting) in order to identify the stale nodes more easily. Lastly, the changes described above are integrated with the Advanced audit log functionality available in Jira Data Center since version 8.8. Any automatic actions will be logged to give admins more visibility what is happening on their instance. For more details please go here. Thank you for voting and commenting on this suggestion, Grażyna Kaszkur Product manager, Jira Server and Data Center
    • We collect Jira feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

      NOTE: This suggestion is for JIRA Server. Using JIRA Cloud? See the corresponding suggestion.

      Problem Definition

      After changing node id in the cluster.properties file both the old and new id will appear in Cluster Nodes section of System information. The problem is worse for AWS, since it will create many new nodes and never reuse them.

      Suggested Solution

      We should find a way to clear out any old ids, without removing any entries that might be from a temporarily offline node.

      Note

      Having old nodes in the system (table) may cause other problems, see related:

      Workaround

      • In a recent version of Jira we introduced the new REST API to manage the cluster state which mitigates the problem. See JRASERVER-69033.
      • Clean-up old data manually:
      1. Check tables and find all rows related to old nodes:
        select * from clusternode;
        select * from clusternodeheartbeat;
        
      2. Delete the related records:
        delete from clusternode where node_id = '<node_id>';
        delete from clusternodeheartbeat where node_id = '<node_id>';
        
      3. Clean old Replication records:
        // check if clean is nessary 
        select count(id) from replicatedindexoperation where node_id = '<node_id>';
        // delete
        delete from replicatedindexoperation where node_id = '<node_id>';
        

            [JRASERVER-42916] Stale node ids should automatically be removed in Jira Data Center

            It would be great if you can also remove inactive nodes from the support ZIP generation page.

            Yevgen Lasman added a comment - It would be great if you can also remove inactive nodes from the support ZIP generation page.

            Atlassian Update – 16 June 2020

            Hi everyone,

            Thank you for your votes and comments on this issue. We would like to inform you that this suggestion will be addressed in the upcoming Jira Data Center version 8.10.0 release.

            We’ve decided to provide more automated way of handling stale (No heartbeat) nodes in Jira Data Center. Before the changes, if a node lost connection to the cluster for 5 minutes, its state changed from “Active” to “No heartbeat”. If such node was not moved to the “Offline” state, it might cause performance degradation.

            We’ve automated this process and the solution is as follows:

            • If a node is in the “No heartbeat” state for longer than 2 days, it will be automatically moved to the “Offline” state. Admins will be informed about this via warning in atlassian-jira.log file and will see such state on the Clustering page. During this period you will be able to check the node or restart it.
            • If a node is in “Offline” state for longer than 2 days, it will be automatically removed from the cluster. Also, you will be informed about such action through the info logs in your atlassian-jira.log file.

            Additionally based on the feedback we received in the comments below, we will be adding in Jira Data Center version 8.11.0 a possibility of adjusting stale nodes retention period of 2 days. You can find more details about this suggestion under this thread.

            Moreover since Jira Data Center 8.6 we are bringing more visibility about nodes in your cluster by introducing Clustering page in the admin panel. In the newly released Jira Data Center version 8.9 we have extended this page by adding additional information about statuses of nodes (Active, No heartbeat, Offline) and Jira DC application status (maintenance, error, running, starting) in order to identify the stale nodes more easily.

            Lastly, the changes described above are integrated with the Advanced audit log functionality available in Jira Data Center since version 8.8. Any automatic actions will be logged to give admins more visibility what is happening on their instance. For more details please go here.

            Thank you for voting and commenting on this suggestion,
            Grażyna Kaszkur
            Product manager, Jira Server and Data Center

            Grazyna Kaszkur added a comment - Atlassian Update – 16 June 2020 Hi everyone, Thank you for your votes and comments on this issue. We would like to inform you that this suggestion will be addressed in the upcoming Jira Data Center version 8.10.0 release. We’ve decided to provide more automated way of handling stale (No heartbeat) nodes in Jira Data Center. Before the changes, if a node lost connection to the cluster for 5 minutes, its state changed from “Active” to “No heartbeat”. If such node was not moved to the “Offline” state, it might cause performance degradation. We’ve automated this process and the solution is as follows: If a node is in the “No heartbeat” state for longer than 2 days, it will be automatically moved to the “Offline” state. Admins will be informed about this via warning in atlassian-jira.log file and will see such state on the Clustering page. During this period you will be able to check the node or restart it. If a node is in “Offline” state for longer than 2 days, it will be automatically removed from the cluster. Also, you will be informed about such action through the info logs in your atlassian-jira.log file. Additionally based on the feedback we received in the comments below, we will be adding in Jira Data Center version 8.11.0 a possibility of adjusting stale nodes retention period of 2 days. You can find more details about this suggestion under this thread. Moreover since Jira Data Center 8.6 we are bringing more visibility about nodes in your cluster by introducing Clustering page in the admin panel . In the newly released Jira Data Center version 8.9 we have extended this page by adding additional information about statuses of nodes (Active, No heartbeat, Offline) and Jira DC application status (maintenance, error, running, starting) in order to identify the stale nodes more easily. Lastly, the changes described above are integrated with the Advanced audit log functionality available in Jira Data Center since version 8.8. Any automatic actions will be logged to give admins more visibility what is happening on their instance. For more details please go here. Thank you for voting and commenting on this suggestion, Grażyna Kaszkur Product manager, Jira Server and Data Center

            Hi ebukoski1 thank you very much for your feedback. 

            Regarding the automatic clean-up after 2 days we plan to add a possibility to adjust this variable in the system properties. When it comes to API  request that would allow you to clean-up stale nodes you can use the existing methods described in the workaround section above and in this separate suggestion: https://jira.atlassian.com/browse/JRASERVER-69033.

            Please let us know if you have any feedback regarding those APIs (released in 8.1.0).

            Grazyna Kaszkur added a comment - Hi ebukoski1  thank you very much for your feedback.  Regarding the automatic clean-up after 2 days we plan to add a possibility to adjust this variable in the system properties. When it comes to API  request that would allow you to clean-up stale nodes you can use the existing methods described in the workaround section above and in this separate suggestion:  https://jira.atlassian.com/browse/JRASERVER-69033 . Please let us know if you have any feedback regarding those APIs (released in 8.1.0).

            Ed Bukoski added a comment -

            I just read the May 22, 2020 update – I like the automatic cleanup aspect of this but two days is a long time when we could tell Jira immediately in our deployment scripts "Hey stop trying to talk to this node, it is gone we terminated it as part of this deployment!" 

            I'm also not excited about 2 days of warning/error messages in the logs and health screens as the active Jira nodes keep trying to communicate to dead ones.  

            So we would still like an API that we can use to manage this, is that being planned as well?

            Ed Bukoski added a comment - I just read the May 22, 2020 update – I like the automatic cleanup aspect of this but two days is a long time when we could tell Jira immediately in our deployment scripts "Hey stop trying to talk to this node, it is gone we terminated it as part of this deployment!"  I'm also not excited about 2 days of warning/error messages in the logs and health screens as the active Jira nodes keep trying to communicate to dead ones.   So we would still like an API that we can use to manage this, is that being planned as well?

            We use AWS auto scaling group for the Jira cluster, and I am thinking to use lifecycle hooks to mitigate this issue:

            For Launching 

            Send a cloud event to a Lambda function to remove the status=offline or alive=false nodes in the database. This would be very helpful when restoring DB from prod to non-prod.

            For Terminating

            Send a cloud event to a Lambda function to stop the Jira service first, then remove the node from the database. This would try the best to gracefully shutdown the Jira service, then clear up this offline node from DB.

            Jackie Chen added a comment - We use AWS auto scaling group for the Jira cluster, and I am thinking to use lifecycle hooks to mitigate this issue: For Launching   Send a cloud event to a Lambda function to remove the status=offline or alive=false nodes in the database. This would be very helpful when restoring DB from prod to non-prod. For Terminating Send a cloud event to a Lambda function to stop the Jira service first, then remove the node from the database. This would try the best to gracefully shutdown the Jira service, then clear up this offline node from DB.

            Exactly Fabian, especially, when you copy production system to development, you change serverID to break appLinks, but then the dev cluster connects to prod cluster completely ignoring that the

            • network is not same
            • serverID is different
            • licenses are different

            But hey for that Atlassian KB contains mention along: delete rows from table or two, when creating dev instance ...

             

            And the thing you must love the most is that attachment folder path is hardcoded in xml backup in Jira DC regardless of the path setting.

            Then you start on dev/migration cluster to delete projects to create smaller xml backup for smaller instance, or for confidentiality reasons...

            Tomas Karas added a comment - Exactly Fabian, especially, when you copy production system to development, you change serverID to break appLinks, but then the dev cluster connects to prod cluster completely ignoring that the network is not same serverID is different licenses are different But hey for that Atlassian KB contains mention along: delete rows from table or two, when creating dev instance ...   And the thing you must love the most is that attachment folder path is hardcoded in xml backup in Jira DC regardless of the path setting. Then you start on dev/migration cluster to delete projects to create smaller xml backup for smaller instance, or for confidentiality reasons...

            This is also painful during system copy from production to non-production environments

            Fabian Fingerle added a comment - This is also painful during system copy from production to non-production environments

            KWRI IT added a comment -

            This one is painful in a Kubernetes deployment of Jira Datacenter.  We're logging into postgresql and cleaning up clusternodes a lot.

            KWRI IT added a comment - This one is painful in a Kubernetes deployment of Jira Datacenter.  We're logging into postgresql and cleaning up clusternodes a lot.

            yyyyt

            Zaid Qureshi added a comment - yyyyt

            In new Jira versions 8.1+ they added a experimental REST API (  JRASERVER-69033 ) Here is a snippet of script that will clean most dead nodes away. Enjoy.

            // 
            nodelist=`curl --user ${USERNAME}:${PASSWORD} -sb --url "${JIRASITE}/rest/api/2/cluster/nodes" | jq -r '.[] | select(.alive=="false",.state=="OFFLINE") | .nodeId' | tr '\n' ' '`
            nodearray=($nodelist)
            for i in "${nodearray[@]}"
            do
               curl -X "DELETE" --user ${USERNAME}:${PASSWORD} -sb --url "${JIRASITE}/rest/api/2/cluster/node/${i}"
            done 
            

             

            Jason Potkanski added a comment - In new Jira versions 8.1+ they added a experimental REST API (  JRASERVER-69033 ) Here is a snippet of script that will clean most dead nodes away. Enjoy. // nodelist=`curl --user ${USERNAME}:${PASSWORD} -sb --url "${JIRASITE}/ rest /api/2/cluster/nodes" | jq -r '.[] | select(.alive== " false " ,.state== "OFFLINE" ) | .nodeId' | tr '\n' ' ' ` nodearray=($nodelist) for i in "${nodearray[@]}" do curl -X "DELETE" --user ${USERNAME}:${PASSWORD} -sb --url "${JIRASITE}/ rest /api/2/cluster/node/${i}" done  

              ddudziak Stasiu
              ayakovlev@atlassian.com Andriy Yakovlev [Atlassian]
              Votes:
              168 Vote for this issue
              Watchers:
              159 Start watching this issue

                Created:
                Updated:
                Resolved: