RemoteElasticInstanceImpl.getID() is called regularly in Bamboo. If any of the supervising threads are working getting the instance id will lock.

      Fix here is to make ID volatile and not synchronized which will avoid the UI from blocking when going to the instances page if the supervisor threads are taking their time. Bamboo considers instances with a null ID to be not setup correctly and can handle this case correctly.

      Attached is the thread dump taken from Panda this morning.

            [BAM-5295] Deadlock in RemoteElasticInstanceImpl

            MarkC added a comment -

            Think the short and long of it is the the object over synchronizes evreything and does way too much slow slow & unreliable stuff inside the sync blocks (like remote calls).

            I haven't looked at the attached dump, but the one this morning, basically it's "hanginng" on describing a instance that's also being terminated. So I suspect that when describing an instance that's shut down, the EC2 probably hangs (it can do that).

            The fact that getID is synced means that all the UI calls, and anything that logs the ID also hangs.

            Thread 1

            "pool-2-thread-1" prio=1 tid=0x0c11a200 nid=0x2d1f runnable [0x57107000..0x57108040]
            	at java.net.SocketInputStream.socketRead0(Native Method)
            	at java.net.SocketInputStream.read(SocketInputStream.java:129)
            	at com.sun.net.ssl.internal.ssl.InputRecord.readFully(InputRecord.java:293)
            	at com.sun.net.ssl.internal.ssl.InputRecord.readV3Record(InputRecord.java:409)
            	at com.sun.net.ssl.internal.ssl.InputRecord.read(InputRecord.java:360)
            	at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:723)
            	- locked <0x73fad100> (a java.lang.Object)
            	at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:680)
            	at com.sun.net.ssl.internal.ssl.AppInputStream.read(AppInputStream.java:75)
            	- locked <0x73fad160> (a com.sun.net.ssl.internal.ssl.AppInputStream)
            	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
            	at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
            	- locked <0x73fad630> (a java.io.BufferedInputStream)
            	at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:77)
            	at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
            	at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1115)
            	at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
            	at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1832)
            	at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1590)
            	at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:995)
            	at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:397)
            	at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:170)
            	at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396)
            	at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)
            	at com.xerox.amazonws.common.AWSQueryConnection.makeRequest(AWSQueryConnection.java:330)
            	at com.xerox.amazonws.ec2.Jec2.makeRequestInt(Jec2.java:1678)
            	at com.xerox.amazonws.ec2.Jec2.describeInstances(Jec2.java:678)
            	at com.atlassian.aws.ec2.RemoteEC2InstanceImpl.describeInstance(RemoteEC2InstanceImpl.java:288)
            	at com.atlassian.aws.ec2.EC2InstanceState$2.supervise(EC2InstanceState.java:77)
            	at com.atlassian.aws.ec2.RemoteEC2InstanceImpl.backgroundSupervise(RemoteEC2InstanceImpl.java:437)
            	- locked <0x73646838> (a com.atlassian.aws.ec2.RemoteEC2InstanceImpl)
            	at com.atlassian.aws.ec2.RemoteEC2InstanceImpl.access$300(RemoteEC2InstanceImpl.java:25)
            	at com.atlassian.aws.ec2.RemoteEC2InstanceImpl$2.run(RemoteEC2InstanceImpl.java:127)
            	at com.atlassian.aws.ec2.RemoteEC2InstanceImpl$CatchingRunnableDecorator.run(RemoteEC2InstanceImpl.java:98)
            	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
            ...
            	at java.lang.Thread.run(Thread.java:595)
            

            Lots of HTTP threads waiting on that +

            "pool-2-thread-3" prio=1 tid=0x0b934180 nid=0x2d29 waiting for monitor entry [0x57086000..0x57086fc0]
            	at com.atlassian.aws.ec2.RemoteEC2InstanceImpl.getID(RemoteEC2InstanceImpl.java:305)
            	- waiting to lock <0x73646838> (a com.atlassian.aws.ec2.RemoteEC2InstanceImpl)
            	at com.atlassian.aws.ec2.RemoteEC2InstanceImpl.getAttachedVolumes(RemoteEC2InstanceImpl.java:265)
            	at com.atlassian.bamboo.agent.elastic.server.EBSVolumeSupervisorImpl$1.run(EBSVolumeSupervisorImpl.java:55)
            	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
            	at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:280)
            	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:135)
            	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:65)
            	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:142)
            	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:166)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
            	at java.lang.Thread.run(Thread.java:595)
            
            
            AND 
            
            
            "QuartzScheduler_Worker-4" prio=1 tid=0x0a77cd98 nid=0x2ccd waiting for monitor entry [0x58bfc000..0x58bfd140]
            	at com.atlassian.aws.ec2.RemoteEC2InstanceImpl.terminate(RemoteEC2InstanceImpl.java)
            	- waiting to lock <0x73646838> (a com.atlassian.aws.ec2.RemoteEC2InstanceImpl)
            	at com.atlassian.bamboo.agent.elastic.server.RemoteElasticInstanceImpl.terminate(RemoteElasticInstanceImpl.java:375)
            	- locked <0x7363e610> (a com.atlassian.bamboo.agent.elastic.server.RemoteElasticInstanceImpl)
            	at com.atlassian.bamboo.agent.elastic.server.ElasticFunctionalityFacadeImpl.shutdownInstance(ElasticFunctionalityFacadeImpl.java:94)
            	at com.atlassian.bamboo.agent.elastic.server.ElasticFunctionalityFacadeImpl.attemptShutdownInstances(ElasticFunctionalityFacadeImpl.java:470)
            	at com.atlassian.bamboo.agent.elastic.server.ElasticFunctionalityFacadeImpl.adjustElasticInstancesForConfig(ElasticFunctionalityFacadeImpl.java:445)
            	at com.atlassian.bamboo.agent.elastic.server.ElasticFunctionalityFacadeImpl.adjustElasticInstanceNumbers(ElasticFunctionalityFacadeImpl.java:428)
            	at com.atlassian.bamboo.agent.elastic.schedule.ElasticInstanceScheduleJob.execute(ElasticInstanceScheduleJob.java:29)
            	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
            	at com.atlassian.bamboo.quartz.SystemAuthorizedThreadPool$1.run(SystemAuthorizedThreadPool.java:36)
            	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525)
            

            There's no actual Java level deadlock, but it's basically a deadlock on the external EC2 resource

            MarkC added a comment - Think the short and long of it is the the object over synchronizes evreything and does way too much slow slow & unreliable stuff inside the sync blocks (like remote calls). I haven't looked at the attached dump, but the one this morning, basically it's "hanginng" on describing a instance that's also being terminated. So I suspect that when describing an instance that's shut down, the EC2 probably hangs (it can do that). The fact that getID is synced means that all the UI calls, and anything that logs the ID also hangs. Thread 1 "pool-2-thread-1" prio=1 tid=0x0c11a200 nid=0x2d1f runnable [0x57107000..0x57108040] at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at com.sun.net.ssl.internal.ssl.InputRecord.readFully(InputRecord.java:293) at com.sun.net.ssl.internal.ssl.InputRecord.readV3Record(InputRecord.java:409) at com.sun.net.ssl.internal.ssl.InputRecord.read(InputRecord.java:360) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:723) - locked <0x73fad100> (a java.lang.Object) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:680) at com.sun.net.ssl.internal.ssl.AppInputStream.read(AppInputStream.java:75) - locked <0x73fad160> (a com.sun.net.ssl.internal.ssl.AppInputStream) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:235) - locked <0x73fad630> (a java.io.BufferedInputStream) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:77) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1115) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1832) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1590) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:995) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:397) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:170) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324) at com.xerox.amazonws.common.AWSQueryConnection.makeRequest(AWSQueryConnection.java:330) at com.xerox.amazonws.ec2.Jec2.makeRequestInt(Jec2.java:1678) at com.xerox.amazonws.ec2.Jec2.describeInstances(Jec2.java:678) at com.atlassian.aws.ec2.RemoteEC2InstanceImpl.describeInstance(RemoteEC2InstanceImpl.java:288) at com.atlassian.aws.ec2.EC2InstanceState$2.supervise(EC2InstanceState.java:77) at com.atlassian.aws.ec2.RemoteEC2InstanceImpl.backgroundSupervise(RemoteEC2InstanceImpl.java:437) - locked <0x73646838> (a com.atlassian.aws.ec2.RemoteEC2InstanceImpl) at com.atlassian.aws.ec2.RemoteEC2InstanceImpl.access$300(RemoteEC2InstanceImpl.java:25) at com.atlassian.aws.ec2.RemoteEC2InstanceImpl$2.run(RemoteEC2InstanceImpl.java:127) at com.atlassian.aws.ec2.RemoteEC2InstanceImpl$CatchingRunnableDecorator.run(RemoteEC2InstanceImpl.java:98) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417) ... at java.lang.Thread.run(Thread.java:595) Lots of HTTP threads waiting on that + "pool-2-thread-3" prio=1 tid=0x0b934180 nid=0x2d29 waiting for monitor entry [0x57086000..0x57086fc0] at com.atlassian.aws.ec2.RemoteEC2InstanceImpl.getID(RemoteEC2InstanceImpl.java:305) - waiting to lock <0x73646838> (a com.atlassian.aws.ec2.RemoteEC2InstanceImpl) at com.atlassian.aws.ec2.RemoteEC2InstanceImpl.getAttachedVolumes(RemoteEC2InstanceImpl.java:265) at com.atlassian.bamboo.agent.elastic.server.EBSVolumeSupervisorImpl$1.run(EBSVolumeSupervisorImpl.java:55) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:280) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:135) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:65) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:142) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:166) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675) at java.lang.Thread.run(Thread.java:595) AND "QuartzScheduler_Worker-4" prio=1 tid=0x0a77cd98 nid=0x2ccd waiting for monitor entry [0x58bfc000..0x58bfd140] at com.atlassian.aws.ec2.RemoteEC2InstanceImpl.terminate(RemoteEC2InstanceImpl.java) - waiting to lock <0x73646838> (a com.atlassian.aws.ec2.RemoteEC2InstanceImpl) at com.atlassian.bamboo.agent.elastic.server.RemoteElasticInstanceImpl.terminate(RemoteElasticInstanceImpl.java:375) - locked <0x7363e610> (a com.atlassian.bamboo.agent.elastic.server.RemoteElasticInstanceImpl) at com.atlassian.bamboo.agent.elastic.server.ElasticFunctionalityFacadeImpl.shutdownInstance(ElasticFunctionalityFacadeImpl.java:94) at com.atlassian.bamboo.agent.elastic.server.ElasticFunctionalityFacadeImpl.attemptShutdownInstances(ElasticFunctionalityFacadeImpl.java:470) at com.atlassian.bamboo.agent.elastic.server.ElasticFunctionalityFacadeImpl.adjustElasticInstancesForConfig(ElasticFunctionalityFacadeImpl.java:445) at com.atlassian.bamboo.agent.elastic.server.ElasticFunctionalityFacadeImpl.adjustElasticInstanceNumbers(ElasticFunctionalityFacadeImpl.java:428) at com.atlassian.bamboo.agent.elastic.schedule.ElasticInstanceScheduleJob.execute(ElasticInstanceScheduleJob.java:29) at org.quartz.core.JobRunShell.run(JobRunShell.java:202) at com.atlassian.bamboo.quartz.SystemAuthorizedThreadPool$1.run(SystemAuthorizedThreadPool.java:36) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525) There's no actual Java level deadlock, but it's basically a deadlock on the external EC2 resource

              jdumay James Dumay
              jdumay James Dumay
              Affected customers:
              0 This affects my team
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved: