Uploaded image for project: 'Bamboo Data Center'
  1. Bamboo Data Center
  2. BAM-9154

Handle Elastic Agents that hang in "Pending" state

    • Icon: Suggestion Suggestion
    • Resolution: Fixed
    • 3.2 final, 3.2
    • None
    • None
    • Our product teams collect and evaluate feedback from a number of different sources. To learn more about how we use customer feedback in the planning process, check out our new feature policy.

      Since using the Windows EC2 instances, I've noticed that instances would fail to start. I believe the root cause of the problem is related to EC2, however, Bamboo can be
      improved to handle this situation more gracefully, as it will result in customers paying for EC2 costs that they don't use

      Symptoms:

      • Elastic instance has been started for a long period of time, yet the elastic agent is still in the pending state
      • You are unable to remote desktop into the instance
      • Running System Log in the EC2 console results in a non-response

      Recommendation: Shutdown the elastic instance if the agent is stuck in the PENDING state for a long period of time (15 minutes?)

        1. 1Pending.jpg
          1Pending.jpg
          24 kB
        2. 2InstanceInfoInBamboo.jpg
          2InstanceInfoInBamboo.jpg
          58 kB
        3. 3awsDash.jpg
          3awsDash.jpg
          29 kB
        4. 4SystemLog.jpg
          4SystemLog.jpg
          32 kB

            [BAM-9154] Handle Elastic Agents that hang in "Pending" state

            Yesterday, we have placed 4 spot instance requests for the same AMI (based on a modified 2011.2 S3 32bit Amazon Linux image). 3 of these instances started correctly, 1 did not. We are unable to connect to that instance and fetch system logs. Are you able to investigate why that happened? The instance is currently active.
            
            The timeline (UTC) is as follows:
            2011-06-30 10:59:30,421 Placed spot request sir-a4764a14
            2011-06-30 11:04:02,998 Spot instance request sir-a4764a14 is now active as instance i-f982ba97 (3 other spot requests too)
            2011-06-30 11:05:31,059 EC2 instance i-f982ba97 is now running at ec2-50-19-59-8.compute-1.amazonaws.com
            
            Hi,
            
            Thank you for contacting AWS Premium Support.
            
            I have taken a look at i-f982ba97 and do see that it launched, but I am not seeing any console output.  Looking at the underlying hardware, at first site I am not seeing any issues with CPU, Network, etc, however when I try and open a socket connection to your instance, as well as to a few others running on the same hardware, I am running into some issues.
            
            I will continue to investigate this as a potential hardware issue, however as this instance is not EBS-backed, you will need to terminate this instance and launch another.
            
            Is there anything else that I can help with at this time?
            
            Best regards,
            
            Travis G.
            Amazon Web Services
            =======================================
            

            Przemek Bruski added a comment - Yesterday, we have placed 4 spot instance requests for the same AMI (based on a modified 2011.2 S3 32bit Amazon Linux image). 3 of these instances started correctly, 1 did not. We are unable to connect to that instance and fetch system logs. Are you able to investigate why that happened? The instance is currently active. The timeline (UTC) is as follows: 2011-06-30 10:59:30,421 Placed spot request sir-a4764a14 2011-06-30 11:04:02,998 Spot instance request sir-a4764a14 is now active as instance i-f982ba97 (3 other spot requests too) 2011-06-30 11:05:31,059 EC2 instance i-f982ba97 is now running at ec2-50-19-59-8.compute-1.amazonaws.com Hi, Thank you for contacting AWS Premium Support. I have taken a look at i-f982ba97 and do see that it launched, but I am not seeing any console output. Looking at the underlying hardware, at first site I am not seeing any issues with CPU, Network, etc, however when I try and open a socket connection to your instance, as well as to a few others running on the same hardware, I am running into some issues. I will continue to investigate this as a potential hardware issue, however as this instance is not EBS-backed, you will need to terminate this instance and launch another. Is there anything else that I can help with at this time? Best regards, Travis G. Amazon Web Services =======================================

            Confirmed as a hardware error in AWS datacentre.

            Przemek Bruski added a comment - Confirmed as a hardware error in AWS datacentre.

            OK, the problem is different than I've expected. I think it may be an error on AWS side and I will file a bug for it (not sure if they will be able to investigate though).
            Nevertheless, having this improvement in place would be a good idea. The problem affects both Windows and Linux instances, console log fetching attempt on a Linux instance is given below:

            kaper:~$ ec2-get-console-output i-f982ba97
            i-f982ba97
            2011-06-30T11:08:29+0000
            

            The solution is to terminate the instance if the tunnel to the agent cannot be established within a reasonable amount of time. Currently, it retries forever.

            Przemek Bruski added a comment - OK, the problem is different than I've expected. I think it may be an error on AWS side and I will file a bug for it (not sure if they will be able to investigate though). Nevertheless, having this improvement in place would be a good idea. The problem affects both Windows and Linux instances, console log fetching attempt on a Linux instance is given below: kaper:~$ ec2-get-console-output i-f982ba97 i-f982ba97 2011-06-30T11:08:29+0000 The solution is to terminate the instance if the tunnel to the agent cannot be established within a reasonable amount of time. Currently, it retries forever.

            Reopened as Bryce saw this behaviour on tardigrade. Will confirm once we see M3 go out on JBAC

            Peter Leschev added a comment - Reopened as Bryce saw this behaviour on tardigrade. Will confirm once we see M3 go out on JBAC

            This should be handled by agent watchdog. It's the only reliable way of shutting down instances anyway.
            It does not work pre-M3 though.

            Przemek Bruski added a comment - This should be handled by agent watchdog. It's the only reliable way of shutting down instances anyway. It does not work pre-M3 though.

              pbruski Przemek Bruski
              pleschev Peter Leschev
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: