Uploaded image for project: 'Bamboo Data Center'
  1. Bamboo Data Center
  2. BAM-2664

Tranfering artifacts of large size require too much CPU resources

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Medium Medium
    • 2.2
    • 2.0.2
    • Agents, Artifacts
    • None
    • App Server: jetty-6.1.5
      Distro: Ubuntu 7.04
      Java: Sun JDK 1.5.0_11

      When the build via a remote build agent is finished, the transfer of build result artifacts is initiated.
      In our situation, these files can be 500-700 megabytes.

      The CPU gets completely taken by the java process (sometimes 99% CPU usage) when transferring these artifacts. I guess there is a lot of overhead in encapsulating the data for transferring through ActiveMQ?

      The result of this intensive usage is that the remote bamboo clients are offline (all of them in batches, so this isn't caused by a network problem), probably by a missing heartbeat due to the load.

      Can the performance of transferring files to the Bamboo server be increased, or made less CPU intensive?

            [BAM-2664] Tranfering artifacts of large size require too much CPU resources

            MarkC added a comment -

            This should be resolved for 2.2. We've changed the way we post back artifacts and the performance and CPU usage should be greatly reduced.

            MarkC added a comment - This should be resolved for 2.2. We've changed the way we post back artifacts and the performance and CPU usage should be greatly reduced.

            Hi Richard,

            Unfortunately not - you need to upgrade to allow configuration of the timeout intervals.

            Cheers,
            Dave.

            David O'Flynn [Atlassian] added a comment - Hi Richard, Unfortunately not - you need to upgrade to allow configuration of the timeout intervals. Cheers, Dave.

            We are experiencing the same issue. We are on 2.0.2 and will upgrade at the earliest. Is it possible to set the heartbeat timeout and interval without the upgrade ? If so, where do you suggest this be done ?

            Richard Neale added a comment - We are experiencing the same issue. We are on 2.0.2 and will upgrade at the earliest. Is it possible to set the heartbeat timeout and interval without the upgrade ? If so, where do you suggest this be done ?

            MarkC added a comment -

            Folks,

            While the original title of this issue hasn't been resolved, we've taken some steps to alleviate the remote agent issues. Namely:

            • Increased the default heartbeat timeout from 15 to 60 seconds (configurable through bamboo.agent.heartbeatTimeoutSeconds property.
            • Increased the default heartbeat check interval from 5s to 20s (configurable through the bamboo.agent.heartbeatCheckInterval property)
            • Increased thread priority of the heartbeat sending
            • When an agent is marked as offline and comes back, it'll be forced to restart (since we don't know what state it's in)
            • Improve the logging around this area. If you're still suffering problems post 2.0.6 please add further logging through:
              log4j.category.com.atlassian.bamboo.buildqueue.manager.RemoteAgentManagerImpl=DEBUG
              log4j.category.com.atlassian.bamboo.v2.build.agent.remote.heartbeat.AgentHeartBeatJob=DEBUG
              

            Please upgrade to 2.0.6 and see if the problem of agents dropping out still exists. If you still run into problems please add the extra logging and attach the new logs in a separate issue / support request, and we'll handle them there.

            Koen, I'll leave this issue open to track improving the performance of artifact transfer.

            Cheers,

            Mark C

            MarkC added a comment - Folks, While the original title of this issue hasn't been resolved, we've taken some steps to alleviate the remote agent issues. Namely: Increased the default heartbeat timeout from 15 to 60 seconds (configurable through bamboo.agent.heartbeatTimeoutSeconds property. Increased the default heartbeat check interval from 5s to 20s (configurable through the bamboo.agent.heartbeatCheckInterval property) Increased thread priority of the heartbeat sending When an agent is marked as offline and comes back, it'll be forced to restart (since we don't know what state it's in) Improve the logging around this area. If you're still suffering problems post 2.0.6 please add further logging through: log4j.category.com.atlassian.bamboo.buildqueue.manager.RemoteAgentManagerImpl=DEBUG log4j.category.com.atlassian.bamboo.v2.build.agent.remote.heartbeat.AgentHeartBeatJob=DEBUG Please upgrade to 2.0.6 and see if the problem of agents dropping out still exists. If you still run into problems please add the extra logging and attach the new logs in a separate issue / support request, and we'll handle them there. Koen, I'll leave this issue open to track improving the performance of artifact transfer. Cheers, Mark C

            I have been unable reproduce any problem with the heartbeat mechanism on my machine during transfer of very large artifacts, even under extreme CPU load.

            Tomorrow, I'll seek additional logging from the affected customers.

            Adrian Hempel [Atlassian] added a comment - I have been unable reproduce any problem with the heartbeat mechanism on my machine during transfer of very large artifacts, even under extreme CPU load. Tomorrow, I'll seek additional logging from the affected customers.

            Indeed, the clients do not have this high CPU load, only the server.
            I also tried to set the heartbeat interval to 20 seconds, instead of the default setting.
            But i still keep having going off line during artifact transfer (only less quickly).

            Koen Vereeken added a comment - Indeed, the clients do not have this high CPU load, only the server. I also tried to set the heartbeat interval to 20 seconds, instead of the default setting. But i still keep having going off line during artifact transfer (only less quickly).

            While CPU usage is high on the server during artifact transfer, profiling the server shows no single obvious performance bottleneck: Most of the CPU usage is going to network reads (15%) and file writes (14%) initiated by ActiveMQ, which is reasonable.

            Message segments are of a reasonable size (most are about 7K).

            I'll have a look at what's preventing the heartbeats from getting through. It may be thread starvation, or it might be flooding of the communications channel.

            Adrian Hempel [Atlassian] added a comment - While CPU usage is high on the server during artifact transfer, profiling the server shows no single obvious performance bottleneck: Most of the CPU usage is going to network reads (15%) and file writes (14%) initiated by ActiveMQ, which is reasonable. Message segments are of a reasonable size (most are about 7K). I'll have a look at what's preventing the heartbeats from getting through. It may be thread starvation, or it might be flooding of the communications channel.

            It seems that the high CPU usage occurs in the server process, rather than the agent process.

            Does this agree with what others are seeing?

            Adrian Hempel [Atlassian] added a comment - It seems that the high CPU usage occurs in the server process, rather than the agent process. Does this agree with what others are seeing?

            I guess this is a major rework so this won't be included in any bug release.
            Is there a workaround for this error?

            I skipped the transfer of artifacts and I didn't have any remote build agent disconnected today..

            Koen Vereeken added a comment - I guess this is a major rework so this won't be included in any bug release. Is there a workaround for this error? I skipped the transfer of artifacts and I didn't have any remote build agent disconnected today..

            Two things we should look at:

            1. Performing the heartbeat on a high priority Thread.
            2. Improving the buffering of artifact transfer.

            Adrian Hempel [Atlassian] added a comment - Two things we should look at: Performing the heartbeat on a high priority Thread. Improving the buffering of artifact transfer.

              ahempel Adrian Hempel [Atlassian]
              9c3b5dce6e5f Koen Vereeken
              Affected customers:
              3 This affects my team
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: