Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-42916

Stale node ids should automatically be removed in Jira Data Center

    • 657
    • 78
    • Hide
      Atlassian Update – 16 June 2020

      Hi everyone,

      Thank you for your votes and comments on this issue. We would like to inform you that this suggestion will be addressed in the upcoming Jira Data Center version 8.10.0 release.

      We’ve decided to provide more automated way of handling stale (No heartbeat) nodes in Jira Data Center. Before the changes, if a node lost connection to the cluster for 5 minutes, its state changed from “Active” to “No heartbeat”. If such node was not moved to the “Offline” state, it might cause performance degradation.

      We’ve automated this process and the solution is as follows:

      • If a node is in the “No heartbeat” state for longer than 2 days, it will be automatically moved to the “Offline” state. Admins will be informed about this via warning in atlassian-jira.log file and will see such state on the Clustering page. During this period you will be able to check the node or restart it.
      • If a node is in “Offline” state for longer than 2 days, it will be automatically removed from the cluster. Also, you will be informed about such action through the info logs in your atlassian-jira.log file.

      Additionally based on the feedback we received in the comments below, we will be adding in Jira Data Center version 8.11.0 a possibility of adjusting stale nodes retention period of 2 days. You can find more details about this suggestion under this thread.

      Moreover since Jira Data Center 8.6 we are bringing more visibility about nodes in your cluster by introducing Clustering page in the admin panel. In the newly released Jira Data Center version 8.9 we have extended this page by adding additional information about statuses of nodes (Active, No heartbeat, Offline) and Jira DC application status (maintenance, error, running, starting) in order to identify the stale nodes more easily.

      Lastly, the changes described above are integrated with the Advanced audit log functionality available in Jira Data Center since version 8.8. Any automatic actions will be logged to give admins more visibility what is happening on their instance. For more details please go here.

      Thank you for voting and commenting on this suggestion,
      Grażyna Kaszkur
      Product manager, Jira Server and Data Center

      Show
      Atlassian Update – 16 June 2020 Hi everyone, Thank you for your votes and comments on this issue. We would like to inform you that this suggestion will be addressed in the upcoming Jira Data Center version 8.10.0 release. We’ve decided to provide more automated way of handling stale (No heartbeat) nodes in Jira Data Center. Before the changes, if a node lost connection to the cluster for 5 minutes, its state changed from “Active” to “No heartbeat”. If such node was not moved to the “Offline” state, it might cause performance degradation. We’ve automated this process and the solution is as follows: If a node is in the “No heartbeat” state for longer than 2 days, it will be automatically moved to the “Offline” state. Admins will be informed about this via warning in atlassian-jira.log file and will see such state on the Clustering page. During this period you will be able to check the node or restart it. If a node is in “Offline” state for longer than 2 days, it will be automatically removed from the cluster. Also, you will be informed about such action through the info logs in your atlassian-jira.log file. Additionally based on the feedback we received in the comments below, we will be adding in Jira Data Center version 8.11.0 a possibility of adjusting stale nodes retention period of 2 days. You can find more details about this suggestion under this thread. Moreover since Jira Data Center 8.6 we are bringing more visibility about nodes in your cluster by introducing Clustering page in the admin panel . In the newly released Jira Data Center version 8.9 we have extended this page by adding additional information about statuses of nodes (Active, No heartbeat, Offline) and Jira DC application status (maintenance, error, running, starting) in order to identify the stale nodes more easily. Lastly, the changes described above are integrated with the Advanced audit log functionality available in Jira Data Center since version 8.8. Any automatic actions will be logged to give admins more visibility what is happening on their instance. For more details please go here. Thank you for voting and commenting on this suggestion, Grażyna Kaszkur Product manager, Jira Server and Data Center
    • We collect Jira feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

      NOTE: This suggestion is for JIRA Server. Using JIRA Cloud? See the corresponding suggestion.

      Problem Definition

      After changing node id in the cluster.properties file both the old and new id will appear in Cluster Nodes section of System information. The problem is worse for AWS, since it will create many new nodes and never reuse them.

      Suggested Solution

      We should find a way to clear out any old ids, without removing any entries that might be from a temporarily offline node.

      Note

      Having old nodes in the system (table) may cause other problems, see related:

      Workaround

      • In a recent version of Jira we introduced the new REST API to manage the cluster state which mitigates the problem. See JRASERVER-69033.
      • Clean-up old data manually:
      1. Check tables and find all rows related to old nodes:
        select * from clusternode;
        select * from clusternodeheartbeat;
        
      2. Delete the related records:
        delete from clusternode where node_id = '<node_id>';
        delete from clusternodeheartbeat where node_id = '<node_id>';
        
      3. Clean old Replication records:
        // check if clean is nessary 
        select count(id) from replicatedindexoperation where node_id = '<node_id>';
        // delete
        delete from replicatedindexoperation where node_id = '<node_id>';
        

            [JRASERVER-42916] Stale node ids should automatically be removed in Jira Data Center

            It would be great if you can also remove inactive nodes from the support ZIP generation page.

            Yevgen Lasman added a comment - It would be great if you can also remove inactive nodes from the support ZIP generation page.

            Atlassian Update – 16 June 2020

            Hi everyone,

            Thank you for your votes and comments on this issue. We would like to inform you that this suggestion will be addressed in the upcoming Jira Data Center version 8.10.0 release.

            We’ve decided to provide more automated way of handling stale (No heartbeat) nodes in Jira Data Center. Before the changes, if a node lost connection to the cluster for 5 minutes, its state changed from “Active” to “No heartbeat”. If such node was not moved to the “Offline” state, it might cause performance degradation.

            We’ve automated this process and the solution is as follows:

            • If a node is in the “No heartbeat” state for longer than 2 days, it will be automatically moved to the “Offline” state. Admins will be informed about this via warning in atlassian-jira.log file and will see such state on the Clustering page. During this period you will be able to check the node or restart it.
            • If a node is in “Offline” state for longer than 2 days, it will be automatically removed from the cluster. Also, you will be informed about such action through the info logs in your atlassian-jira.log file.

            Additionally based on the feedback we received in the comments below, we will be adding in Jira Data Center version 8.11.0 a possibility of adjusting stale nodes retention period of 2 days. You can find more details about this suggestion under this thread.

            Moreover since Jira Data Center 8.6 we are bringing more visibility about nodes in your cluster by introducing Clustering page in the admin panel. In the newly released Jira Data Center version 8.9 we have extended this page by adding additional information about statuses of nodes (Active, No heartbeat, Offline) and Jira DC application status (maintenance, error, running, starting) in order to identify the stale nodes more easily.

            Lastly, the changes described above are integrated with the Advanced audit log functionality available in Jira Data Center since version 8.8. Any automatic actions will be logged to give admins more visibility what is happening on their instance. For more details please go here.

            Thank you for voting and commenting on this suggestion,
            Grażyna Kaszkur
            Product manager, Jira Server and Data Center

            Grazyna Kaszkur added a comment - Atlassian Update – 16 June 2020 Hi everyone, Thank you for your votes and comments on this issue. We would like to inform you that this suggestion will be addressed in the upcoming Jira Data Center version 8.10.0 release. We’ve decided to provide more automated way of handling stale (No heartbeat) nodes in Jira Data Center. Before the changes, if a node lost connection to the cluster for 5 minutes, its state changed from “Active” to “No heartbeat”. If such node was not moved to the “Offline” state, it might cause performance degradation. We’ve automated this process and the solution is as follows: If a node is in the “No heartbeat” state for longer than 2 days, it will be automatically moved to the “Offline” state. Admins will be informed about this via warning in atlassian-jira.log file and will see such state on the Clustering page. During this period you will be able to check the node or restart it. If a node is in “Offline” state for longer than 2 days, it will be automatically removed from the cluster. Also, you will be informed about such action through the info logs in your atlassian-jira.log file. Additionally based on the feedback we received in the comments below, we will be adding in Jira Data Center version 8.11.0 a possibility of adjusting stale nodes retention period of 2 days. You can find more details about this suggestion under this thread. Moreover since Jira Data Center 8.6 we are bringing more visibility about nodes in your cluster by introducing Clustering page in the admin panel . In the newly released Jira Data Center version 8.9 we have extended this page by adding additional information about statuses of nodes (Active, No heartbeat, Offline) and Jira DC application status (maintenance, error, running, starting) in order to identify the stale nodes more easily. Lastly, the changes described above are integrated with the Advanced audit log functionality available in Jira Data Center since version 8.8. Any automatic actions will be logged to give admins more visibility what is happening on their instance. For more details please go here. Thank you for voting and commenting on this suggestion, Grażyna Kaszkur Product manager, Jira Server and Data Center

            Hi ebukoski1 thank you very much for your feedback. 

            Regarding the automatic clean-up after 2 days we plan to add a possibility to adjust this variable in the system properties. When it comes to API  request that would allow you to clean-up stale nodes you can use the existing methods described in the workaround section above and in this separate suggestion: https://jira.atlassian.com/browse/JRASERVER-69033.

            Please let us know if you have any feedback regarding those APIs (released in 8.1.0).

            Grazyna Kaszkur added a comment - Hi ebukoski1  thank you very much for your feedback.  Regarding the automatic clean-up after 2 days we plan to add a possibility to adjust this variable in the system properties. When it comes to API  request that would allow you to clean-up stale nodes you can use the existing methods described in the workaround section above and in this separate suggestion:  https://jira.atlassian.com/browse/JRASERVER-69033 . Please let us know if you have any feedback regarding those APIs (released in 8.1.0).

            Ed Bukoski added a comment -

            I just read the May 22, 2020 update – I like the automatic cleanup aspect of this but two days is a long time when we could tell Jira immediately in our deployment scripts "Hey stop trying to talk to this node, it is gone we terminated it as part of this deployment!" 

            I'm also not excited about 2 days of warning/error messages in the logs and health screens as the active Jira nodes keep trying to communicate to dead ones.  

            So we would still like an API that we can use to manage this, is that being planned as well?

            Ed Bukoski added a comment - I just read the May 22, 2020 update – I like the automatic cleanup aspect of this but two days is a long time when we could tell Jira immediately in our deployment scripts "Hey stop trying to talk to this node, it is gone we terminated it as part of this deployment!"  I'm also not excited about 2 days of warning/error messages in the logs and health screens as the active Jira nodes keep trying to communicate to dead ones.   So we would still like an API that we can use to manage this, is that being planned as well?

            We use AWS auto scaling group for the Jira cluster, and I am thinking to use lifecycle hooks to mitigate this issue:

            For Launching 

            Send a cloud event to a Lambda function to remove the status=offline or alive=false nodes in the database. This would be very helpful when restoring DB from prod to non-prod.

            For Terminating

            Send a cloud event to a Lambda function to stop the Jira service first, then remove the node from the database. This would try the best to gracefully shutdown the Jira service, then clear up this offline node from DB.

            Jackie Chen added a comment - We use AWS auto scaling group for the Jira cluster, and I am thinking to use lifecycle hooks to mitigate this issue: For Launching   Send a cloud event to a Lambda function to remove the status=offline or alive=false nodes in the database. This would be very helpful when restoring DB from prod to non-prod. For Terminating Send a cloud event to a Lambda function to stop the Jira service first, then remove the node from the database. This would try the best to gracefully shutdown the Jira service, then clear up this offline node from DB.

            Exactly Fabian, especially, when you copy production system to development, you change serverID to break appLinks, but then the dev cluster connects to prod cluster completely ignoring that the

            • network is not same
            • serverID is different
            • licenses are different

            But hey for that Atlassian KB contains mention along: delete rows from table or two, when creating dev instance ...

             

            And the thing you must love the most is that attachment folder path is hardcoded in xml backup in Jira DC regardless of the path setting.

            Then you start on dev/migration cluster to delete projects to create smaller xml backup for smaller instance, or for confidentiality reasons...

            Tomas Karas added a comment - Exactly Fabian, especially, when you copy production system to development, you change serverID to break appLinks, but then the dev cluster connects to prod cluster completely ignoring that the network is not same serverID is different licenses are different But hey for that Atlassian KB contains mention along: delete rows from table or two, when creating dev instance ...   And the thing you must love the most is that attachment folder path is hardcoded in xml backup in Jira DC regardless of the path setting. Then you start on dev/migration cluster to delete projects to create smaller xml backup for smaller instance, or for confidentiality reasons...

            This is also painful during system copy from production to non-production environments

            Fabian Fingerle added a comment - This is also painful during system copy from production to non-production environments

            KWRI IT added a comment -

            This one is painful in a Kubernetes deployment of Jira Datacenter.  We're logging into postgresql and cleaning up clusternodes a lot.

            KWRI IT added a comment - This one is painful in a Kubernetes deployment of Jira Datacenter.  We're logging into postgresql and cleaning up clusternodes a lot.

            yyyyt

            Zaid Qureshi added a comment - yyyyt

            In new Jira versions 8.1+ they added a experimental REST API (  JRASERVER-69033 ) Here is a snippet of script that will clean most dead nodes away. Enjoy.

            // 
            nodelist=`curl --user ${USERNAME}:${PASSWORD} -sb --url "${JIRASITE}/rest/api/2/cluster/nodes" | jq -r '.[] | select(.alive=="false",.state=="OFFLINE") | .nodeId' | tr '\n' ' '`
            nodearray=($nodelist)
            for i in "${nodearray[@]}"
            do
               curl -X "DELETE" --user ${USERNAME}:${PASSWORD} -sb --url "${JIRASITE}/rest/api/2/cluster/node/${i}"
            done 
            

             

            Jason Potkanski added a comment - In new Jira versions 8.1+ they added a experimental REST API (  JRASERVER-69033 ) Here is a snippet of script that will clean most dead nodes away. Enjoy. // nodelist=`curl --user ${USERNAME}:${PASSWORD} -sb --url "${JIRASITE}/ rest /api/2/cluster/nodes" | jq -r '.[] | select(.alive== " false " ,.state== "OFFLINE" ) | .nodeId' | tr '\n' ' ' ` nodearray=($nodelist) for i in "${nodearray[@]}" do curl -X "DELETE" --user ${USERNAME}:${PASSWORD} -sb --url "${JIRASITE}/ rest /api/2/cluster/node/${i}" done  

            Please update us on this one.

            Cecilia W Jägerbrink added a comment - Please update us on this one.

            Six month mentioned in "Current Status" has passed, any updated on this one?

            Yevgen Lasman added a comment - Six month mentioned in "Current Status" has passed, any updated on this one?

            We've got application log spammed with cache replication error messages related inaccessible old nodes. Can't wait to see automatic nodes cleanup implemented!

            Yevgen Lasman added a comment - We've got application log spammed with cache replication error messages related inaccessible old nodes. Can't wait to see automatic nodes cleanup implemented!

            The inability of the tool to remove old nodes creates problems for us when trying to restore an xml backup from our prod instance into our non-prod instance.  Several of our production nodes remain 'Active' in the clusternode table.  And, as a result, the non-prod nodes attempt to replicate their cache(s) on production nodes.  

            Michael Bulger added a comment - The inability of the tool to remove old nodes creates problems for us when trying to restore an xml backup from our prod instance into our non-prod instance.  Several of our production nodes remain 'Active' in the clusternode table.  And, as a result, the non-prod nodes attempt to replicate their cache(s) on production nodes.  

            Hi everyone,

            Thank you for your votes and thoughts on this issue. We fully understand that many of you are dependent on this functionality.

            After careful consideration, we've decided to prioritize this suggestion on Jira Server roadmap. We hope to start development after our current projects are completed. 

            Expect to hear an update on our progress within the next 6 months. 

            To learn more on how you suggestions are reviewed, see our updated workflow for server feature suggestions.

            Kind regards,

            Jira Server Product Management

            Grazyna Kaszkur added a comment - Hi everyone, Thank you for your votes and thoughts on this issue. We fully understand that many of you are dependent on this functionality. After careful consideration, we've decided to prioritize this suggestion on Jira Server roadmap. We hope to start development after our current projects are completed.  Expect to hear an update on our progress within the next 6 months.  To learn more on how you suggestions are reviewed, see our updated workflow for server feature suggestions. Kind regards, Jira Server Product Management

            Hi,

             

            The issue was created >3y ago and since then multiple people spotted different issues related to JIRAs inability to handle nodes going away.
            I don't know if its just me but in my opinion is unacceptable that this is not being addressed as it breaks the core principle of the DC product and that would be high availability and proper clustering.

             

            There is one more scenario that I've not seen describe here that happened to me. When you do a foreground full re-index of Jira and you lose your node that does the indexing the process is locked and never completes, can't be cancelled. The only solution that seems to work is to remove entries in db stop/drop all nodes and start again. It is just crazy.

             

            I believe this product would benefit from Atlassian not trying to get into the clustered FTS solution of their own and for example allow configuring Jira DC with external software like Elasticsearch. I have initiated a discussion via different support request and would encourage people that may be interested into jumping on it and +1 it.
            https://jira.atlassian.com/browse/JRASERVER-68048

            This would circumvent the issue reported here.

            Moreover I can already imagine that you can have full re-indexing happening by creating a new separate index and then once completed only flipping the elasticsearch Alias. In the meantime Jira would continue to use the old index so it wouldn't lead to a lockdown.

            Elasticsearch is also available from AWS as a managed service so seems like a perfect fit for their cloud offering that is currently broken. Jira DC offering on AWS marketplace is just a mistake in its current form imho.

             

            Thanks

             

            Tony Iskander added a comment - Hi,   The issue was created >3y ago and since then multiple people spotted different issues related to JIRAs inability to handle nodes going away. I don't know if its just me but in my opinion is unacceptable that this is not being addressed as it breaks the core principle of the DC product and that would be high availability and proper clustering.   There is one more scenario that I've not seen describe here that happened to me. When you do a foreground full re-index of Jira and you lose your node that does the indexing the process is locked and never completes, can't be cancelled. The only solution that seems to work is to remove entries in db stop/drop all nodes and start again. It is just crazy.   I believe this product would benefit from Atlassian not trying to get into the clustered FTS solution of their own and for example allow configuring Jira DC with external software like Elasticsearch. I have initiated a discussion via different support request and would encourage people that may be interested into jumping on it and +1 it. https://jira.atlassian.com/browse/JRASERVER-68048 This would circumvent the issue reported here. Moreover I can already imagine that you can have full re-indexing happening by creating a new separate index and then once completed only flipping the elasticsearch Alias. In the meantime Jira would continue to use the old index so it wouldn't lead to a lockdown. Elasticsearch is also available from AWS as a managed service so seems like a perfect fit for their cloud offering that is currently broken. Jira DC offering on AWS marketplace is just a mistake in its current form imho.   Thanks  

            My index replication is behind when I have a lot of stale nodes because the nodes are attempting to query the nodes that are no longer online and fail.  I'm not sure what method is used for determining which node a particular node requests indexes from, but in our case with a 3 node cluster having 12 abandoned nodes all three nodes are reporting that the other two are behind.  When I look through the logs of all of the nodes there are tons of errors about connecting to abandoned nodes for index replication.

            Jesse Rehmer [Contegix] added a comment - My index replication is behind when I have a lot of stale nodes because the nodes are attempting to query the nodes that are no longer online and fail.  I'm not sure what method is used for determining which node a particular node requests indexes from, but in our case with a 3 node cluster having 12 abandoned nodes all three nodes are reporting that the other two are behind.  When I look through the logs of all of the nodes there are tons of errors about connecting to abandoned nodes for index replication.

            Just to weigh in - I have experienced problems before from the way Zephyr uses these tables to coordinate it's own indexing.

            Jack [AppFox] added a comment - Just to weigh in - I have experienced problems before from the way Zephyr uses these tables to coordinate it's own indexing.

            Matt Doar added a comment -

            Any reason for suspecting that the stale node ids cause index corruptions? Seems a bit unlikely given the way that Jira Data Center is constructed.

            Matt Doar added a comment - Any reason for suspecting that the stale node ids cause index corruptions? Seems a bit unlikely given the way that Jira Data Center is constructed.

            We have a 2 node Jira DC instance running in AWS and had to swap out nodes in the past due to performance issues. as result our list of historical nodes it quite long. We suspect this has impact on Jira and Zephyr index resulting in index corruptions happening occasionally. This is unacceptable for a high availability application and need to be looked at.

            Dorota Goffin added a comment - We have a 2 node Jira DC instance running in AWS and had to swap out nodes in the past due to performance issues. as result our list of historical nodes it quite long. We suspect this has impact on Jira and Zephyr index resulting in index corruptions happening occasionally. This is unacceptable for a high availability application and need to be looked at.

            We are in same situation as other customers who are deploying in AWS. and as we are in testing phase, JIRA DC option doesn't look enterprise ready without a way to fix this issue. 

            Mansi Patel added a comment - We are in same situation as other customers who are deploying in AWS. and as we are in testing phase, JIRA DC option doesn't look enterprise ready without a way to fix this issue. 

            S Stack added a comment - - edited

            Indirectly related: Restoring production data to a test DC instance causes old nodes to appear in the "Cluster Nodes" section of System Information. As a workaround, we're deleting rows for the old node id's from the clusternode table.

            S Stack added a comment - - edited Indirectly related : Restoring production data to a test DC instance causes old nodes to appear in the "Cluster Nodes" section of System Information. As a workaround, we're deleting rows for the old node id's from the clusternode  table.

            We are deployed in AWS and are regularly turning over our servers, so this causes a lot of inactive nodes to be displayed in JIRA

            Kevin Terminella added a comment - We are deployed in AWS and are regularly turning over our servers, so this causes a lot of inactive nodes to be displayed in JIRA

              ddudziak Stasiu
              ayakovlev@atlassian.com Andriy Yakovlev [Atlassian]
              Votes:
              168 Vote for this issue
              Watchers:
              159 Start watching this issue

                Created:
                Updated:
                Resolved: