• Icon: Bug Bug
    • Resolution: Unsolved Mysteries
    • Icon: Medium Medium
    • 3.2 M0, 3.2
    • 3.0
    • Elastic Bamboo
    • None

      Hi, I have a plan that keeps putting EC2 agents offline even when the machine is fine. I suspect of some activity peaks or some time without output (not sure how do you decide whether an agent is alive or not)

      2011-02-22 16:56:13,013 WARN [QuartzScheduler_Worker-1] [RemoteAgentManagerImpl] Detected that remote agent 'Elastic Agent on i-17a8f07b' has been inactive since Tue Feb 22 16:52:24 CST 2011
      2011-02-22 16:56:13,013 WARN [QuartzScheduler_Worker-1] [RemoteAgentManagerImpl] Marking remote agent 'Elastic Agent on i-17a8f07b' as unresponsive
      

      As I mention, the machine looks good. I ssh it after it was marked as unresponsive and got this info:

      top - 18:38:33 up  1:22,  2 users,  load average: 0.00, 0.00, 0.84
      Tasks:  59 total,   1 running,  58 sleeping,   0 stopped,   0 zombie
      Cpu(s):  0.0%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.3%st
      Mem:   1788724k total,   293776k used,  1494948k free,   105368k buffers
      Swap:   917496k total,    86584k used,   830912k free,    24792k cached
      
        PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                                        
       1602 bamboo    25   0  910m 100m 3604 S    0  5.8   4:20.68 java                                                                                                                                                                                                            
          1 root      15   0  2132   88   56 S    0  0.0   0:04.08 init                                                                                                                                                                                                            
          2 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/0                                                                                                                                                                                                     
          3 root      34  19     0    0    0 S    0  0.0   0:00.02 ksoftirqd/0         
      
      ps faux
      root      1571  0.0  0.0   2664     8 ?        S    17:18   0:00 su -c bamboo-elastic-agent - bamboo
      bamboo    1575  0.0  0.0   2592     8 ?        Ss   17:18   0:00  \_ /bin/bash /opt/bamboo-elastic-agent/bin/bamboo-elastic-agent
      bamboo    1602  5.3  5.7 932276 103092 ?       Sl   17:18   4:20      \_ java -server -Xms32m -Xmx512m -XX:MaxPermSize=256m -cp /opt/bamboo-elastic-agent/bin/../lib:/opt/bamboo-elastic-agent/bin/../lib/spring-beans-2.0.7.jar:/opt/bamboo-elastic-agent/bin/../lib/jcl-ove
      bamboo    1605  0.0  0.0   1756    64 ?        S    17:18   0:00      \_ tee -a /home/bamboo/bamboo-elastic-agent.out
      
      

      Here is the plan: https://bamboo.extranet.atlassian.com/browse/CONFFUNC-PARA

          Form Name

            [BAM-8093] Elastic agents go offline prematurely

            hi, I can't reproduce this problem anymore. Feel free to close this issue.

            Adrian Deccico [Atlassian] added a comment - hi, I can't reproduce this problem anymore. Feel free to close this issue.

            From now I will increase the heartbeat of BEAC, any good reason for not doing that?

            I assume you are talking about the heartbeat timeout, not the heartbeat interval - yeah, you should be able to use the default value of 600 seconds.
            It's only a workaround though, let's get back to 200 seconds once you can set up this: https://extranet.atlassian.com/display/BAMBOO/Elastic+agent+log+capture (it will be available with 3.1 M1). I'll also try to set up this plan on Tardigrade so that we don't have to experiment on BEAC.

            Przemek Bruski added a comment - From now I will increase the heartbeat of BEAC, any good reason for not doing that? I assume you are talking about the heartbeat timeout, not the heartbeat interval - yeah, you should be able to use the default value of 600 seconds. It's only a workaround though, let's get back to 200 seconds once you can set up this: https://extranet.atlassian.com/display/BAMBOO/Elastic+agent+log+capture (it will be available with 3.1 M1). I'll also try to set up this plan on Tardigrade so that we don't have to experiment on BEAC.

            Can you tell what the CPU usage of the machine is during the time that Bamboo server cannot communicate with the agent?

            Unfortunately I don't have that metric.

            does this happen whenever the same actual test is executed? Can you tell this from looking at the test logs and figure out which tests were running during that time?

            I checked at the logs and the problem happens randomly (different tests) but always in the same plan.

            From now I will increase the heartbeat of BEAC, any good reason for not doing that?

            Adrian Deccico [Atlassian] added a comment - Can you tell what the CPU usage of the machine is during the time that Bamboo server cannot communicate with the agent? Unfortunately I don't have that metric. does this happen whenever the same actual test is executed? Can you tell this from looking at the test logs and figure out which tests were running during that time? I checked at the logs and the problem happens randomly (different tests) but always in the same plan. From now I will increase the heartbeat of BEAC, any good reason for not doing that?

            Wow, 6 builds in a row - nasty... did you have a look at the agent log file?

            not sure how do you decide whether an agent is alive or not

            Agents are considered to be alive if they have sent a heartbeat message within the last 600 seconds (BEAC has this configured at 200 seconds though).

            Przemek Bruski added a comment - Wow, 6 builds in a row - nasty... did you have a look at the agent log file? not sure how do you decide whether an agent is alive or not Agents are considered to be alive if they have sent a heartbeat message within the last 600 seconds (BEAC has this configured at 200 seconds though).

            AntonA added a comment -

            Adrian,

            We will look into this.

            Can you tell what the CPU usage of the machine is during the time that Bamboo server cannot communicate with the agent? So for above build it would be between Tue Feb 22 16:52:24 CST 2011 and 2011-02-22 16:56:13,013?

            Also, does this happen whenever the same actual test is executed? Can you tell this from looking at the test logs and figure out which tests were running during that time?

            Cheers,
            Anton

            AntonA added a comment - Adrian, We will look into this. Can you tell what the CPU usage of the machine is during the time that Bamboo server cannot communicate with the agent? So for above build it would be between Tue Feb 22 16:52:24 CST 2011 and 2011-02-22 16:56:13,013? Also, does this happen whenever the same actual test is executed? Can you tell this from looking at the test logs and figure out which tests were running during that time? Cheers, Anton

              pbruski Przemek Bruski
              adeccico Adrian Deccico [Atlassian]
              Affected customers:
              0 This affects my team
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: