Uploaded image for project: 'Bamboo Data Center'
  1. Bamboo Data Center
  2. BAM-3721

Elastic agent does not exit on communication error

    • Our product teams collect and evaluate feedback from a number of different sources. To learn more about how we use customer feedback in the planning process, check out our new feature policy.

      When Bamboo server does not respond to heartbeat elastic agent does not finish - it tries to reach server and all subsequent heartbeat coall are cancelled as first on is in progress.

            [BAM-3721] Elastic agent does not exit on communication error

            This should be fixed by BAM-8731 .

            Przemek Bruski added a comment - This should be fixed by BAM-8731 .

            What's more - on that EC2 instance - after the incident - or rather during the weird state (according to Bamboo server), but after the build is finished (according to elastic agent's logs) I can successfully telnet locally to both ports mentioned in agent's logs:

            • telnet 127.0.0.1 4527
            • telnet 127.0.0.1 46593

            Krystian Nowak added a comment - What's more - on that EC2 instance - after the incident - or rather during the weird state (according to Bamboo server), but after the build is finished (according to elastic agent's logs) I can successfully telnet locally to both ports mentioned in agent's logs: telnet 127.0.0.1 4527 telnet 127.0.0.1 46593

            In another instance of such incident in elastic agent's logs I've got (apart from the messages given in previous comment) this one connected with the inability to send artefacts back to Bamboo server:

            2009-06-16 22:10:19,659 ERROR [BAM::Elastic Agent on i-a9bbe5c0::Agent] [HttpMessageSender] Exception occurred when posting message com.atlassian.bamboo.v2.build.agent.messages.PublishArtifacts@c71d
            5a. Connection has been shutdown: javax.net.ssl.SSLException: java.net.SocketException: Broken pipe
            javax.net.ssl.SSLException: Connection has been shutdown: javax.net.ssl.SSLException: java.net.SocketException: Broken pipe
                    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.checkEOF(SSLSocketImpl.java:1255)
                    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.checkWrite(SSLSocketImpl.java:1267)
                    at com.sun.net.ssl.internal.ssl.AppOutputStream.write(AppOutputStream.java:43)
                    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
                    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
                    at org.apache.commons.httpclient.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:119)
                    at org.apache.commons.httpclient.ChunkedOutputStream.write(ChunkedOutputStream.java:178)
                    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838)
                    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1747)
                    at java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1541)
                    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:329)
                    at com.atlassian.bamboo.v2.build.agent.remote.sender.HttpMessageSender$1.writeRequest(HttpMessageSender.java:44)
                    at org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:495)
                    at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:1973)
                    at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:993)
                    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:397)
                    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:170)
                    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396)
                    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)
                    at com.atlassian.bamboo.v2.build.agent.remote.sender.HttpMessageSender.send(HttpMessageSender.java:76)
                    at com.atlassian.bamboo.v2.build.agent.remote.RemoteArtifactManager.publish(RemoteArtifactManager.java:26)
                    at com.atlassian.bamboo.build.artifact.BuildArtifactPostProcessor.call(BuildArtifactPostProcessor.java:62)
                    at com.atlassian.bamboo.build.pipeline.tasks.ExecuteBuildTask.performCustomBuildProcess(ExecuteBuildTask.java:110)
                    at com.atlassian.bamboo.build.pipeline.tasks.ExecuteBuildTask.call(ExecuteBuildTask.java:70)
                    at com.atlassian.bamboo.v2.build.agent.DefaultBuildAgent.build(DefaultBuildAgent.java:188)
                    at com.atlassian.bamboo.v2.build.agent.BuildAgentControllerImpl.waitAndPerformBuild(BuildAgentControllerImpl.java:89)
                    at com.atlassian.bamboo.v2.build.agent.DefaultBuildAgent$1.run(DefaultBuildAgent.java:101)
                    at com.atlassian.bamboo.build.pipeline.concurrent.NamedThreadFactory$1.run(NamedThreadFactory.java:32)
                    at edu.emory.mathcs.backport.java.util.concurrent.helpers.ThreadHelpers$1.run(ThreadHelpers.java:34)
                    at java.lang.Thread.run(Thread.java:619)
            Caused by: javax.net.ssl.SSLException: java.net.SocketException: Broken pipe
                    at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:190)
                    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1611)
                    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1574)
                    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.handleException(SSLSocketImpl.java:1538)
                    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.handleException(SSLSocketImpl.java:1483)
                    at com.sun.net.ssl.internal.ssl.AppOutputStream.write(AppOutputStream.java:64)
                    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
                    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
                    at org.apache.commons.httpclient.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:119)
                    at org.apache.commons.httpclient.ChunkedOutputStream.write(ChunkedOutputStream.java:178)
                    at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1814)
                    at java.io.ObjectOutputStream.write(ObjectOutputStream.java:681)
                    at java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:275)
                    at com.atlassian.bamboo.v2.build.agent.messages.PublishArtifacts.writeObject(PublishArtifacts.java:81)
                    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                    at java.lang.reflect.Method.invoke(Method.java:597)
                    at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945)
                    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1461)
                    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
                    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
                    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
                    ... 19 more
            Caused by: java.net.SocketException: Broken pipe
                    at java.net.SocketOutputStream.socketWrite0(Native Method)
                    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
                    at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
                    at com.sun.net.ssl.internal.ssl.OutputRecord.writeBuffer(OutputRecord.java:295)
                    at com.sun.net.ssl.internal.ssl.OutputRecord.write(OutputRecord.java:284)
                    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:734)
                    at com.sun.net.ssl.internal.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:722)
                    at com.sun.net.ssl.internal.ssl.AppOutputStream.write(AppOutputStream.java:59)
                    ... 36 more
            2009-06-16 22:10:19,660 INFO [BAM::Elastic Agent on i-a9bbe5c0::Agent] [HttpMethodDirector] I/O exception (javax.net.ssl.SSLException) caught when processing request: Connection has been shutdown: j
            avax.net.ssl.SSLException: java.net.SocketException: Broken pipe
            2009-06-16 22:10:19,660 INFO [BAM::Elastic Agent on i-a9bbe5c0::Agent] [HttpMethodDirector] Retrying request
            2009-06-16 22:10:22,519 INFO [pool-1-thread-45] [LocalAcceptor] Accepted local connection on port 46593.
            2009-06-16 22:10:22,596 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection.
            2009-06-16 22:10:22,597 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection.
            2009-06-16 22:10:25,904 INFO [pool-1-thread-30] [LocalAcceptor] Waiting for local connection on port 46593.
            2009-06-16 22:10:26,827 INFO [pool-1-thread-30] [LocalAcceptor] Accepted local connection on port 46593.
            2009-06-16 22:10:26,998 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection.
            2009-06-16 22:10:26,998 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection.
            2009-06-16 22:10:27,308 INFO [pool-1-thread-43] [LocalAcceptor] Waiting for local connection on port 46593.
            

            So it seems that once the connection is broken, it's not re-tried or just the re-connection is unsuccessful.

            What's weird, I cannot see anything suspicious in Bamboo server's logs...

            Krystian Nowak added a comment - In another instance of such incident in elastic agent's logs I've got (apart from the messages given in previous comment) this one connected with the inability to send artefacts back to Bamboo server: 2009-06-16 22:10:19,659 ERROR [BAM::Elastic Agent on i-a9bbe5c0::Agent] [HttpMessageSender] Exception occurred when posting message com.atlassian.bamboo.v2.build.agent.messages.PublishArtifacts@c71d 5a. Connection has been shutdown: javax.net.ssl.SSLException: java.net.SocketException: Broken pipe javax.net.ssl.SSLException: Connection has been shutdown: javax.net.ssl.SSLException: java.net.SocketException: Broken pipe at com.sun.net.ssl.internal.ssl.SSLSocketImpl.checkEOF(SSLSocketImpl.java:1255) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.checkWrite(SSLSocketImpl.java:1267) at com.sun.net.ssl.internal.ssl.AppOutputStream.write(AppOutputStream.java:43) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at org.apache.commons.httpclient.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:119) at org.apache.commons.httpclient.ChunkedOutputStream.write(ChunkedOutputStream.java:178) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1747) at java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1541) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:329) at com.atlassian.bamboo.v2.build.agent.remote.sender.HttpMessageSender$1.writeRequest(HttpMessageSender.java:44) at org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:495) at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:1973) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:993) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:397) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:170) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324) at com.atlassian.bamboo.v2.build.agent.remote.sender.HttpMessageSender.send(HttpMessageSender.java:76) at com.atlassian.bamboo.v2.build.agent.remote.RemoteArtifactManager.publish(RemoteArtifactManager.java:26) at com.atlassian.bamboo.build.artifact.BuildArtifactPostProcessor.call(BuildArtifactPostProcessor.java:62) at com.atlassian.bamboo.build.pipeline.tasks.ExecuteBuildTask.performCustomBuildProcess(ExecuteBuildTask.java:110) at com.atlassian.bamboo.build.pipeline.tasks.ExecuteBuildTask.call(ExecuteBuildTask.java:70) at com.atlassian.bamboo.v2.build.agent.DefaultBuildAgent.build(DefaultBuildAgent.java:188) at com.atlassian.bamboo.v2.build.agent.BuildAgentControllerImpl.waitAndPerformBuild(BuildAgentControllerImpl.java:89) at com.atlassian.bamboo.v2.build.agent.DefaultBuildAgent$1.run(DefaultBuildAgent.java:101) at com.atlassian.bamboo.build.pipeline.concurrent.NamedThreadFactory$1.run(NamedThreadFactory.java:32) at edu.emory.mathcs.backport.java.util.concurrent.helpers.ThreadHelpers$1.run(ThreadHelpers.java:34) at java.lang.Thread.run(Thread.java:619) Caused by: javax.net.ssl.SSLException: java.net.SocketException: Broken pipe at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:190) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1611) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1574) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.handleException(SSLSocketImpl.java:1538) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.handleException(SSLSocketImpl.java:1483) at com.sun.net.ssl.internal.ssl.AppOutputStream.write(AppOutputStream.java:64) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at org.apache.commons.httpclient.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:119) at org.apache.commons.httpclient.ChunkedOutputStream.write(ChunkedOutputStream.java:178) at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1814) at java.io.ObjectOutputStream.write(ObjectOutputStream.java:681) at java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:275) at com.atlassian.bamboo.v2.build.agent.messages.PublishArtifacts.writeObject(PublishArtifacts.java:81) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1461) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326) ... 19 more Caused by: java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at com.sun.net.ssl.internal.ssl.OutputRecord.writeBuffer(OutputRecord.java:295) at com.sun.net.ssl.internal.ssl.OutputRecord.write(OutputRecord.java:284) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:734) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:722) at com.sun.net.ssl.internal.ssl.AppOutputStream.write(AppOutputStream.java:59) ... 36 more 2009-06-16 22:10:19,660 INFO [BAM::Elastic Agent on i-a9bbe5c0::Agent] [HttpMethodDirector] I/O exception (javax.net.ssl.SSLException) caught when processing request: Connection has been shutdown: j avax.net.ssl.SSLException: java.net.SocketException: Broken pipe 2009-06-16 22:10:19,660 INFO [BAM::Elastic Agent on i-a9bbe5c0::Agent] [HttpMethodDirector] Retrying request 2009-06-16 22:10:22,519 INFO [pool-1-thread-45] [LocalAcceptor] Accepted local connection on port 46593. 2009-06-16 22:10:22,596 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection. 2009-06-16 22:10:22,597 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection. 2009-06-16 22:10:25,904 INFO [pool-1-thread-30] [LocalAcceptor] Waiting for local connection on port 46593. 2009-06-16 22:10:26,827 INFO [pool-1-thread-30] [LocalAcceptor] Accepted local connection on port 46593. 2009-06-16 22:10:26,998 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection. 2009-06-16 22:10:26,998 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection. 2009-06-16 22:10:27,308 INFO [pool-1-thread-43] [LocalAcceptor] Waiting for local connection on port 46593. So it seems that once the connection is broken, it's not re-tried or just the re-connection is unsuccessful. What's weird, I cannot see anything suspicious in Bamboo server's logs...

            I've observed it also for my setup - both Bamboo 2.2.2 and now 2.2.3. It seems to be occurring especially when build produces large amount of logs (is very verbose). This somehow raises the probability of communication failure (SSH tunnel connection seems to be broken for a while). I can observe in remote agent's logs some spikes in communication:

            2009-06-16 21:16:14,916 INFO [pool-1-thread-146] [LocalAcceptor] Accepted local connection on port 4527.
            2009-06-16 21:16:20,861 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection.
            2009-06-16 21:16:20,861 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection.
            2009-06-16 21:16:21,397 INFO [pool-1-thread-115] [LocalAcceptor] Waiting for local connection on port 4527.
            2009-06-16 21:16:21,708 INFO [pool-1-thread-115] [LocalAcceptor] Accepted local connection on port 4527.
            2009-06-16 21:16:22,200 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection.
            2009-06-16 21:16:22,201 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection.
            2009-06-16 21:16:22,203 INFO [pool-1-thread-158] [LocalAcceptor] Waiting for local connection on port 4527.
            2009-06-16 21:16:22,948 INFO [pool-1-thread-158] [LocalAcceptor] Accepted local connection on port 4527.
            2009-06-16 21:16:27,452 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection.
            2009-06-16 21:16:27,452 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection.
            2009-06-16 21:16:27,471 INFO [pool-1-thread-126] [LocalAcceptor] Waiting for local connection on port 4527.
            2009-06-16 21:16:35,505 INFO [ActiveMQ Transport: tcp://localhost/127.0.0.1:4527] [FailoverTransport] Transport failed, attempting to automatically reconnect due to: java.net.SocketException: Socket
             closed
            java.net.SocketException: Socket closed
                    at java.net.SocketInputStream.socketRead0(Native Method)
                    at java.net.SocketInputStream.read(SocketInputStream.java:129)
                    at org.apache.activemq.transport.tcp.TcpBufferedInputStream.fill(TcpBufferedInputStream.java:49)
                    at org.apache.activemq.transport.tcp.TcpBufferedInputStream.read(TcpBufferedInputStream.java:56)
                    at java.io.DataInputStream.readInt(DataInputStream.java:370)
                    at org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:267)
                    at org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:156)
                    at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:136)
                    at java.lang.Thread.run(Thread.java:619)
            2009-06-16 21:16:35,508 INFO [pool-1-thread-126] [LocalAcceptor] Accepted local connection on port 4527.
            2009-06-16 21:16:35,588 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection.
            2009-06-16 21:16:35,588 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection.
            2009-06-16 21:16:36,035 INFO [pool-1-thread-151] [LocalAcceptor] Waiting for local connection on port 4527.
            2009-06-16 21:16:52,186 INFO [ActiveMQ Transport: tcp://localhost/127.0.0.1:4527] [FailoverTransport] Transport failed, attempting to automatically reconnect due to: java.net.SocketException: Socket closed
            java.net.SocketException: Socket closed
                    at java.net.SocketInputStream.socketRead0(Native Method)
                    at java.net.SocketInputStream.read(SocketInputStream.java:129)
                    at org.apache.activemq.transport.tcp.TcpBufferedInputStream.fill(TcpBufferedInputStream.java:49)
                    at org.apache.activemq.transport.tcp.TcpBufferedInputStream.read(TcpBufferedInputStream.java:56)
                    at java.io.DataInputStream.readInt(DataInputStream.java:370)
                    at org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:267)
                    at org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:156)
                    at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:136)
                    at java.lang.Thread.run(Thread.java:619)
            2009-06-16 21:16:52,188 INFO [pool-1-thread-151] [LocalAcceptor] Accepted local connection on port 4527.
            2009-06-16 21:16:52,209 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection.
            2009-06-16 21:16:52,210 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection.
            2009-06-16 21:16:52,211 INFO [pool-1-thread-126] [LocalAcceptor] Waiting for local connection on port 4527.
            2009-06-16 21:16:53,540 INFO [pool-1-thread-126] [LocalAcceptor] Accepted local connection on port 4527.
            2009-06-16 21:17:16,018 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection.
            2009-06-16 21:17:16,018 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection.
            2009-06-16 21:17:16,451 INFO [pool-1-thread-126] [LocalAcceptor] Waiting for local connection on port 4527.
            

            Not always, but very frequently, they cause Bamboo server to "think" a build running on the elastic agent to be still in progress - e.g. displaying:

            Bamboo has detected that OXP28-CANOOPOSTGRESQL262TO27X-4 may have hung.
            This build has been running for 482 minutes, which is 487% longer than usual.
            It has been 473 minutes since Bamboo received a log message for this build. 
            

            and some commands/logs from "the middle" of the build process, whereas in agent's logs I can spot that the build has already been finished:

            2009-06-16 22:19:38,895 INFO [BAM::Elastic Agent on i-2757094e::Agent] [ExecuteBuildTask] Running post build plugin 'Post Command Agent Runner'
            2009-06-16 22:19:38,959 INFO [BAM::Elastic Agent on i-2757094e::Agent] [PostBuildCommandAgentAction] post command set to run on agent, running
            2009-06-16 22:19:38,966 INFO [BAM::Elastic Agent on i-2757094e::Agent] [AbstractSubstitutionBean] Substituting variable: ${bamboo.projectBaseDir} with /mnt/bamboo-ebs/bamboo-agent/build-dir/OXP28-CANOOPOSTGRESQL262TO27X
            2009-06-16 22:19:38,966 INFO [BAM::Elastic Agent on i-2757094e::Agent] [AbstractSubstitutionBean] Substituting variable: ${bamboo.buildKey} with OXP28-CANOOPOSTGRESQL262TO27X
            2009-06-16 22:19:38,966 INFO [BAM::Elastic Agent on i-2757094e::Agent] [PostBuildCommandRunner] running command: /mnt/bamboo-ebs/bamboo-agent/build-dir/OXP28-CANOOPOSTGRESQL262TO27X/../../post-success.sh OXP28-CANOOPOSTGRESQL262TO27X
            2009-06-16 22:19:38,993 INFO [Thread-45] [PostBuildCommandRunner] post success for build OXP28-CANOOPOSTGRESQL262TO27X
            2009-06-16 22:19:38,993 INFO [Thread-45] [PostBuildCommandRunner] DATABASE = postgresql-8.3
            2009-06-16 22:19:39,995 INFO [Thread-45] [PostBuildCommandRunner] pg_ctl: server is running (PID: 4964)
            2009-06-16 22:19:39,996 INFO [Thread-45] [PostBuildCommandRunner] Stopping PostgreSQL: server stopped
            2009-06-16 22:19:39,996 INFO [Thread-45] [PostBuildCommandRunner] ok
            2009-06-16 22:19:39,996 INFO [Thread-45] [PostBuildCommandRunner] pg_ctl: no server running
            2009-06-16 22:19:44,999 INFO [BAM::Elastic Agent on i-2757094e::Agent] [PostBuildCommandRunner] exit status: 0
            2009-06-16 22:19:44,999 INFO [BAM::Elastic Agent on i-2757094e::Agent] [ExecuteBuildTask] Running post build plugin 'NCover Results Collector'
            2009-06-16 22:19:45,005 INFO [pool-1-thread-176] [LocalAcceptor] Accepted local connection on port 46593.
            2009-06-16 22:19:45,009 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection.
            2009-06-16 22:19:45,010 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection.
            2009-06-16 22:19:45,098 INFO [pool-1-thread-175] [LocalAcceptor] Waiting for local connection on port 46593.
            2009-06-16 22:19:45,098 INFO [pool-1-thread-175] [LocalAcceptor] Accepted local connection on port 46593.
            2009-06-16 22:19:45,101 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection.
            2009-06-16 22:19:45,101 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection.
            2009-06-16 22:19:45,104 INFO [pool-1-thread-175] [LocalAcceptor] Waiting for local connection on port 46593.
            2009-06-16 22:19:45,118 INFO [BAM::Elastic Agent on i-2757094e::Agent] [NCoverBuildProcessor] inside NCoverBuildProcessor.call()
            2009-06-16 22:19:45,118 INFO [BAM::Elastic Agent on i-2757094e::Agent] [ExecuteBuildTask] Running post build plugin 'Clover Results Collector'
            2009-06-16 22:19:45,181 INFO [BAM::Elastic Agent on i-2757094e::Agent] [ExecuteBuildTask] Running post build plugin 'PMD Build Processor'
            2009-06-16 22:19:45,199 INFO [BAM::Elastic Agent on i-2757094e::Agent] [PMDBuildProcessor] Running PMD parser
            

            So basically isn't able to get the info that the build has already been finished, and it requires nothing better than just shutting the failed EC2 instance MANUALLY from Administration -> Elastic Bamboo -> Instances menu.

            Krystian Nowak added a comment - I've observed it also for my setup - both Bamboo 2.2.2 and now 2.2.3. It seems to be occurring especially when build produces large amount of logs (is very verbose). This somehow raises the probability of communication failure (SSH tunnel connection seems to be broken for a while). I can observe in remote agent's logs some spikes in communication: 2009-06-16 21:16:14,916 INFO [pool-1-thread-146] [LocalAcceptor] Accepted local connection on port 4527. 2009-06-16 21:16:20,861 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection. 2009-06-16 21:16:20,861 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection. 2009-06-16 21:16:21,397 INFO [pool-1-thread-115] [LocalAcceptor] Waiting for local connection on port 4527. 2009-06-16 21:16:21,708 INFO [pool-1-thread-115] [LocalAcceptor] Accepted local connection on port 4527. 2009-06-16 21:16:22,200 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection. 2009-06-16 21:16:22,201 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection. 2009-06-16 21:16:22,203 INFO [pool-1-thread-158] [LocalAcceptor] Waiting for local connection on port 4527. 2009-06-16 21:16:22,948 INFO [pool-1-thread-158] [LocalAcceptor] Accepted local connection on port 4527. 2009-06-16 21:16:27,452 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection. 2009-06-16 21:16:27,452 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection. 2009-06-16 21:16:27,471 INFO [pool-1-thread-126] [LocalAcceptor] Waiting for local connection on port 4527. 2009-06-16 21:16:35,505 INFO [ActiveMQ Transport: tcp://localhost/127.0.0.1:4527] [FailoverTransport] Transport failed, attempting to automatically reconnect due to: java.net.SocketException: Socket closed java.net.SocketException: Socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at org.apache.activemq.transport.tcp.TcpBufferedInputStream.fill(TcpBufferedInputStream.java:49) at org.apache.activemq.transport.tcp.TcpBufferedInputStream.read(TcpBufferedInputStream.java:56) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:267) at org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:156) at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:136) at java.lang.Thread.run(Thread.java:619) 2009-06-16 21:16:35,508 INFO [pool-1-thread-126] [LocalAcceptor] Accepted local connection on port 4527. 2009-06-16 21:16:35,588 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection. 2009-06-16 21:16:35,588 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection. 2009-06-16 21:16:36,035 INFO [pool-1-thread-151] [LocalAcceptor] Waiting for local connection on port 4527. 2009-06-16 21:16:52,186 INFO [ActiveMQ Transport: tcp://localhost/127.0.0.1:4527] [FailoverTransport] Transport failed, attempting to automatically reconnect due to: java.net.SocketException: Socket closed java.net.SocketException: Socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at org.apache.activemq.transport.tcp.TcpBufferedInputStream.fill(TcpBufferedInputStream.java:49) at org.apache.activemq.transport.tcp.TcpBufferedInputStream.read(TcpBufferedInputStream.java:56) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:267) at org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:156) at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:136) at java.lang.Thread.run(Thread.java:619) 2009-06-16 21:16:52,188 INFO [pool-1-thread-151] [LocalAcceptor] Accepted local connection on port 4527. 2009-06-16 21:16:52,209 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection. 2009-06-16 21:16:52,210 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection. 2009-06-16 21:16:52,211 INFO [pool-1-thread-126] [LocalAcceptor] Waiting for local connection on port 4527. 2009-06-16 21:16:53,540 INFO [pool-1-thread-126] [LocalAcceptor] Accepted local connection on port 4527. 2009-06-16 21:17:16,018 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection. 2009-06-16 21:17:16,018 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection. 2009-06-16 21:17:16,451 INFO [pool-1-thread-126] [LocalAcceptor] Waiting for local connection on port 4527. Not always, but very frequently, they cause Bamboo server to "think" a build running on the elastic agent to be still in progress - e.g. displaying: Bamboo has detected that OXP28-CANOOPOSTGRESQL262TO27X-4 may have hung. This build has been running for 482 minutes, which is 487% longer than usual. It has been 473 minutes since Bamboo received a log message for this build. and some commands/logs from "the middle" of the build process, whereas in agent's logs I can spot that the build has already been finished: 2009-06-16 22:19:38,895 INFO [BAM::Elastic Agent on i-2757094e::Agent] [ExecuteBuildTask] Running post build plugin 'Post Command Agent Runner' 2009-06-16 22:19:38,959 INFO [BAM::Elastic Agent on i-2757094e::Agent] [PostBuildCommandAgentAction] post command set to run on agent, running 2009-06-16 22:19:38,966 INFO [BAM::Elastic Agent on i-2757094e::Agent] [AbstractSubstitutionBean] Substituting variable: ${bamboo.projectBaseDir} with /mnt/bamboo-ebs/bamboo-agent/build-dir/OXP28-CANOOPOSTGRESQL262TO27X 2009-06-16 22:19:38,966 INFO [BAM::Elastic Agent on i-2757094e::Agent] [AbstractSubstitutionBean] Substituting variable: ${bamboo.buildKey} with OXP28-CANOOPOSTGRESQL262TO27X 2009-06-16 22:19:38,966 INFO [BAM::Elastic Agent on i-2757094e::Agent] [PostBuildCommandRunner] running command: /mnt/bamboo-ebs/bamboo-agent/build-dir/OXP28-CANOOPOSTGRESQL262TO27X/../../post-success.sh OXP28-CANOOPOSTGRESQL262TO27X 2009-06-16 22:19:38,993 INFO [Thread-45] [PostBuildCommandRunner] post success for build OXP28-CANOOPOSTGRESQL262TO27X 2009-06-16 22:19:38,993 INFO [Thread-45] [PostBuildCommandRunner] DATABASE = postgresql-8.3 2009-06-16 22:19:39,995 INFO [Thread-45] [PostBuildCommandRunner] pg_ctl: server is running (PID: 4964) 2009-06-16 22:19:39,996 INFO [Thread-45] [PostBuildCommandRunner] Stopping PostgreSQL: server stopped 2009-06-16 22:19:39,996 INFO [Thread-45] [PostBuildCommandRunner] ok 2009-06-16 22:19:39,996 INFO [Thread-45] [PostBuildCommandRunner] pg_ctl: no server running 2009-06-16 22:19:44,999 INFO [BAM::Elastic Agent on i-2757094e::Agent] [PostBuildCommandRunner] exit status: 0 2009-06-16 22:19:44,999 INFO [BAM::Elastic Agent on i-2757094e::Agent] [ExecuteBuildTask] Running post build plugin 'NCover Results Collector' 2009-06-16 22:19:45,005 INFO [pool-1-thread-176] [LocalAcceptor] Accepted local connection on port 46593. 2009-06-16 22:19:45,009 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection. 2009-06-16 22:19:45,010 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection. 2009-06-16 22:19:45,098 INFO [pool-1-thread-175] [LocalAcceptor] Waiting for local connection on port 46593. 2009-06-16 22:19:45,098 INFO [pool-1-thread-175] [LocalAcceptor] Accepted local connection on port 46593. 2009-06-16 22:19:45,101 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Accepted tunnel connection. 2009-06-16 22:19:45,101 INFO [com.sun.sungrid.service.tunnel.server.TunnelServer] [TunnelServer] Waiting for tunnel connection. 2009-06-16 22:19:45,104 INFO [pool-1-thread-175] [LocalAcceptor] Waiting for local connection on port 46593. 2009-06-16 22:19:45,118 INFO [BAM::Elastic Agent on i-2757094e::Agent] [NCoverBuildProcessor] inside NCoverBuildProcessor.call() 2009-06-16 22:19:45,118 INFO [BAM::Elastic Agent on i-2757094e::Agent] [ExecuteBuildTask] Running post build plugin 'Clover Results Collector' 2009-06-16 22:19:45,181 INFO [BAM::Elastic Agent on i-2757094e::Agent] [ExecuteBuildTask] Running post build plugin 'PMD Build Processor' 2009-06-16 22:19:45,199 INFO [BAM::Elastic Agent on i-2757094e::Agent] [PMDBuildProcessor] Running PMD parser So basically isn't able to get the info that the build has already been finished, and it requires nothing better than just shutting the failed EC2 instance MANUALLY from Administration -> Elastic Bamboo -> Instances menu.

            MarkC added a comment -

            Does the updateHeartBeat actually hang or is it just rerunning the fail over transport?

            If you turn the debug logging for ActiveMQ you see in much better detail what's happenning

            log4j.category.org.apache.activemq.transport.failover.FailoverTransport=DEBUG
            log4j.category.org.apache.activemq=DEBUG
            

            There already is some manual retry mechanism in the updateHeartBeat method if an exception is thrown. We're are using the Failover transport so we might be able to tweak the configuration there to return exception more consistently.

            MarkC added a comment - Does the updateHeartBeat actually hang or is it just rerunning the fail over transport? If you turn the debug logging for ActiveMQ you see in much better detail what's happenning log4j.category.org.apache.activemq.transport.failover.FailoverTransport=DEBUG log4j.category.org.apache.activemq=DEBUG There already is some manual retry mechanism in the updateHeartBeat method if an exception is thrown. We're are using the Failover transport so we might be able to tweak the configuration there to return exception more consistently.

              pbruski Przemek Bruski
              mwent Marek Went (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: