[out-of-memory-errors-reported-as-failed] Out of memory errors reported as failed pipelines, even when ignored - Create and track feature requests for Atlassian products.

Type: Bug
Resolution: Fixed
Priority: Low
Component/s: Pipelines - Run Failures
Labels:
- migrated

Bug Fix Policy:
View Atlassian Cloud bug fix policy

Hi guys and @rahulchhabria,

We've got a smallish monolithic repo(~1.5Mloc) that we have been building with pipelines

So far pipelines itself is great and appears to have regular improvements, which we love - we're been tossing up moving away from internal buildbot and relying on this entirely

There are a couple of issues though that we are having trouble with, and have been wondering if there are nice solutions to:

There appears to be no nice way to maintain build state, short of seeding the container with object files (eg checking in build artifacts). This means that even a tiny change results in a 40-60 minute build.

For this, we understand that letting people have incremental builds leads to a whole host of problems, but it'd be great to have an advanced feature that lets you maintain image state by default (acknowledging that you may get failed builds or weirdness) and only do a full build if specified, e.g. if there is an error

Related, if people check in multiple commits we get multiple 40-60 minute builds (is there no build holfoff time as with buildbot? If it doesn't exist already it might be nice to have the option to only build head and not all intermediate commits)
We can't build with any more than -j3 as we get a non-recoverable error halfway through the build that memory is exhausted. Peak build memory usage on our centos6 system with -j6 is 3.6GB (vs 2.6 with -j3 and 4.1 with -j8), so I'm not sure why this doesn't work.

The fact that the error for OOM is fatal, non-recoverable, and takes 20 minutes to occur is itself counter-productive too - a continuable soft limit using cgroups or similar (perhaps including email warnings) would be infinitely preferable

Having basic interactive console access to the system would be very helpful. Having to edit the pipelines yml, commit, wait for email error, repeat is a slow workflow

Apologies if any of these are covered in documentation - I haven't been through all of it in the last month or so, so will likely have missed things

Matt Ryall added a comment - 22/Feb/2018 5:04 AM

We have addressed large 8GB builds with a recent change, described here: https://blog.bitbucket.org/2018/02/20/support-large-builds-bitbucket-pipelines/

I believe this edge case bug has also been fixed. Please let us know if it is still causing issues.

Matt Ryall added a comment - 22/Feb/2018 5:04 AM We have addressed large 8GB builds with a recent change, described here: https://blog.bitbucket.org/2018/02/20/support-large-builds-bitbucket-pipelines/ I believe this edge case bug has also been fixed. Please let us know if it is still causing issues.

Matt Ryall added a comment - 24/Feb/2017 4:29 AM

Issue ~~BCLOUD-13834~~ was marked as a duplicate of this issue.

Matt Ryall added a comment - 24/Feb/2017 4:29 AM Issue BCLOUD-13834 was marked as a duplicate of this issue.

nick_viv added a comment - 20/Feb/2017 1:19 AM

Thanks Matt, see ~~BCLOUD-13874~~

As this is already an issue for us we'd like to see if this was possible sooner rather than later - our builds are slow with your current infrastructure, and it seems like a simple change on your side to make them significantly faster.

nick_viv added a comment - 20/Feb/2017 1:19 AM Thanks Matt, see BCLOUD-13874 As this is already an issue for us we'd like to see if this was possible sooner rather than later - our builds are slow with your current infrastructure, and it seems like a simple change on your side to make them significantly faster.

Matt Ryall added a comment - 20/Feb/2017 1:11 AM

Hi guys,

It seems you've found a bug with how our new infrastructure presents memory errors, so I'll update this ticket to reflect that. To explain:

We have always configured the memory limit in Docker using -m as suggested above, which sets the cgroup memory limit and results in the kernel automatically killing any process that exceeds this limit ("OOM killed").
Any command in Pipelines that is OOM killed would return a failure exit code, which would normally terminate the Pipeline at this point. But in your case, you're using || true (or similar tricks) to continue running the script when this occurs.
On our previous infrastructure we didn't detect processes were OOM killed as a distinct failure mode, but since we migrated to new infrastructure in the past two weeks, now we do. We check for the "memory exceeded" flag on the container when it completes, and if it does – regardless of the Pipeline exit code – we mark the build as failed based on memory exhaustion.

Our mistaken expectation was that any pipeline that reported an "OOM killed" status would also be a failed build. However, in your case, that isn't true. Your pipeline is ignoring the process failure and continuing to run. So the fix would be for us to switch back to using the pipeline script exit code as the build status, and add the OOM status as a warning on the build instead in the case where the build is successful.

Does that sound correct to you both?

Before I try more debugging can you explain a bit more about why you can't just allocate 8gb - either for us, or for all containers?

Sure, the reason is that we've planned our Pipelines infrastructure and pricing based on allocating 4 GB to each customer. So larger memory allocations is something we could offer only with changes to our planned pricing. Based on the customer feedback we've heard to date, 4 GB seems to be enough for the vast majority of Pipelines users, although we know there are some exceptions.

If you'd like to open a ticket to specifically request increased memory, we can track demand for this and consider it in the future. I'd prefer to keep this ticket tracking the bug with OOM build status reporting, given the relevant technical details included above.

Thanks,
Matt

Matt Ryall added a comment - 20/Feb/2017 1:11 AM Hi guys, It seems you've found a bug with how our new infrastructure presents memory errors, so I'll update this ticket to reflect that. To explain: We have always configured the memory limit in Docker using -m as suggested above, which sets the cgroup memory limit and results in the kernel automatically killing any process that exceeds this limit ("OOM killed"). Any command in Pipelines that is OOM killed would return a failure exit code, which would normally terminate the Pipeline at this point. But in your case, you're using || true (or similar tricks) to continue running the script when this occurs. On our previous infrastructure we didn't detect processes were OOM killed as a distinct failure mode, but since we migrated to new infrastructure in the past two weeks, now we do. We check for the "memory exceeded" flag on the container when it completes, and if it does – regardless of the Pipeline exit code – we mark the build as failed based on memory exhaustion. Our mistaken expectation was that any pipeline that reported an "OOM killed" status would also be a failed build. However, in your case, that isn't true. Your pipeline is ignoring the process failure and continuing to run. So the fix would be for us to switch back to using the pipeline script exit code as the build status, and add the OOM status as a warning on the build instead in the case where the build is successful. Does that sound correct to you both? Before I try more debugging can you explain a bit more about why you can't just allocate 8gb - either for us, or for all containers? Sure, the reason is that we've planned our Pipelines infrastructure and pricing based on allocating 4 GB to each customer. So larger memory allocations is something we could offer only with changes to our planned pricing . Based on the customer feedback we've heard to date, 4 GB seems to be enough for the vast majority of Pipelines users, although we know there are some exceptions. If you'd like to open a ticket to specifically request increased memory, we can track demand for this and consider it in the future. I'd prefer to keep this ticket tracking the bug with OOM build status reporting, given the relevant technical details included above. Thanks, Matt

nick_viv added a comment - 20/Feb/2017 12:14 AM

Hey guys,

Thanks for the responses. In order

Joshua:

Dependency caching may help although this is internal to our project - so we'd have to split up the repo or similar.
You're correct on the intermediate commits, this is my mistake

Matt:

Yes it's sampling at 10Hz, see https://github.com/jhclark/memusg if you're interested (note looking at resident rather than virtual)
I've spent a bit of time already trying to fit in with the 4gb limit without major changes to our repo or the way we do things here. Before I try more debugging can you explain a bit more about why you can't just allocate 8gb - either for us, or for all containers?

nick_viv added a comment - 20/Feb/2017 12:14 AM Hey guys, Thanks for the responses. In order Joshua: Dependency caching may help although this is internal to our project - so we'd have to split up the repo or similar. You're correct on the intermediate commits, this is my mistake Matt: Yes it's sampling at 10Hz, see https://github.com/jhclark/memusg if you're interested (note looking at resident rather than virtual) I've spent a bit of time already trying to fit in with the 4gb limit without major changes to our repo or the way we do things here. Before I try more debugging can you explain a bit more about why you can't just allocate 8gb - either for us, or for all containers?

Christopher Moore added a comment - 17/Feb/2017 11:21 AM

However, we have recently improved the error handling so that memory errors are now reported in the Pipelines UI.

How recently was this made? Last Friday? I think it's what caused my builds to start failing. I had a workaround for OOM in place and working. Now pipelines considers the build failed after oom, even if my script recovers and finishes successfully.

My workaround is to run full-parallel until oom killer stopis it, then try again serially:

#!bash
set -e
make -j4 || true
make

Christopher Moore added a comment - 17/Feb/2017 11:21 AM However, we have recently improved the error handling so that memory errors are now reported in the Pipelines UI. How recently was this made? Last Friday? I think it's what caused my builds to start failing. I had a workaround for OOM in place and working. Now pipelines considers the build failed after oom, even if my script recovers and finishes successfully. My workaround is to run full-parallel until oom killer stopis it, then try again serially: #!bash set -e make -j4 || true make

Matt Ryall added a comment - 17/Feb/2017 5:44 AM

Hi Nick,

I can add two points on the specific memory problem you're having.

First, our handling of builds that exceed the memory limit (failing the build immediately) is unfortunately driven by platform limitations in terms of how Docker works with the Linux kernel. We haven't been able to find a way to provide a softer limit to customers yet. (However, we have recently improved the error handling so that memory errors are now reported in the Pipelines UI.) We'll continue to keep our eyes open for development in this area - please let us know if you have any specific ideas.

Second, your memory monitoring is probably doing sampling and might not get all the allocations that hit the OS. So there may be peak memory usage higher than what your tool is reporting. We'd love to have better monitoring of build resource usage that we could provide to our customers, but unfortunately developing that feature is quite far down our priorities right now. Likewise for offering an interactive shell. We're working on first enabling Docker build and push, and a few other highly voted features.

If you have the time and impetus to investigate this further yourself, the best replica environment is probably spinning up your build container on Amazon ECS, and seeing if you can get the memory metrics you need there to find a way to get your build consistently below 4GB.

If we can help you further at all with this, please let us know.

Regards,
Matt

Matt Ryall added a comment - 17/Feb/2017 5:44 AM Hi Nick, I can add two points on the specific memory problem you're having. First, our handling of builds that exceed the memory limit (failing the build immediately) is unfortunately driven by platform limitations in terms of how Docker works with the Linux kernel. We haven't been able to find a way to provide a softer limit to customers yet. (However, we have recently improved the error handling so that memory errors are now reported in the Pipelines UI.) We'll continue to keep our eyes open for development in this area - please let us know if you have any specific ideas. Second, your memory monitoring is probably doing sampling and might not get all the allocations that hit the OS. So there may be peak memory usage higher than what your tool is reporting. We'd love to have better monitoring of build resource usage that we could provide to our customers, but unfortunately developing that feature is quite far down our priorities right now. Likewise for offering an interactive shell. We're working on first enabling Docker build and push, and a few other highly voted features. If you have the time and impetus to investigate this further yourself, the best replica environment is probably spinning up your build container on Amazon ECS, and seeing if you can get the memory metrics you need there to find a way to get your build consistently below 4GB. If we can help you further at all with this, please let us know. Regards, Matt

Joshua Tjhin (Inactive) added a comment - 17/Feb/2017 5:30 AM

Hi Nick,

Thanks for the feedback.

There appears to be no nice way to maintain build state, short of seeding the container with object files (eg checking in build artifacts). This means that even a tiny change results in a 40-60 minute build.

We have an open feature request to allow dependencies to be cached between pipelines ~~BCLOUD-12818~~. Could you let me know if such a solution would help?

Related, if people check in multiple commits we get multiple 40-60 minute builds (is there no build holfoff time as with buildbot? If it doesn't exist already it might be nice to have the option to only build head and not all intermediate commits)

This also sounds like ~~BCLOUD-13463~~

Now for your OOM error, unfortunately we currently can't increase memory beyond 4g as this means we would need to increase memory for all builds across Pipelines. I'm not familiar enough with -jX and peak memory so I'll have a colleague add a further reply.

Regards,
Joshua

Joshua Tjhin (Inactive) added a comment - 17/Feb/2017 5:30 AM Hi Nick, Thanks for the feedback. There appears to be no nice way to maintain build state, short of seeding the container with object files (eg checking in build artifacts). This means that even a tiny change results in a 40-60 minute build. We have an open feature request to allow dependencies to be cached between pipelines BCLOUD-12818 . Could you let me know if such a solution would help? Related, if people check in multiple commits we get multiple 40-60 minute builds (is there no build holfoff time as with buildbot? If it doesn't exist already it might be nice to have the option to only build head and not all intermediate commits) This also sounds like BCLOUD-13463 Now for your OOM error, unfortunately we currently can't increase memory beyond 4g as this means we would need to increase memory for all builds across Pipelines. I'm not familiar enough with -jX and peak memory so I'll have a colleague add a further reply. Regards, Joshua

nick_viv added a comment - 17/Feb/2017 12:52 AM

Edit: Actually I just realised we can presumably manage our own cgroups directly in the image, ignore that part

nick_viv added a comment - 17/Feb/2017 12:52 AM Edit: Actually I just realised we can presumably manage our own cgroups directly in the image, ignore that part

Assignee:: Unassigned

Reporter:: nick_viv

Affected customers:: 1 This affects my team

Watchers:: 5 Start watching this issue

Created:: 17/Feb/2017 12:25 AM

Updated:: 22/Feb/2018 5:04 AM

Resolved:: 22/Feb/2018 5:04 AM

Details

Description

Attachments

Forms

Activity

[BCLOUD-13863] Out of memory errors reported as failed pipelines, even when ignored

Collapse comment: Matt Ryall added a comment - 22/Feb/2018 5:04 AM

Expand comment: Matt Ryall added a comment - 22/Feb/2018 5:04 AM

Collapse comment: Matt Ryall added a comment - 24/Feb/2017 4:29 AM

Expand comment: Matt Ryall added a comment - 24/Feb/2017 4:29 AM

Collapse comment: nick_viv added a comment - 20/Feb/2017 1:19 AM

Expand comment: nick_viv added a comment - 20/Feb/2017 1:19 AM

Collapse comment: Matt Ryall added a comment - 20/Feb/2017 1:11 AM

Expand comment: Matt Ryall added a comment - 20/Feb/2017 1:11 AM

Collapse comment: nick_viv added a comment - 20/Feb/2017 12:14 AM

Expand comment: nick_viv added a comment - 20/Feb/2017 12:14 AM

Collapse comment: Christopher Moore added a comment - 17/Feb/2017 11:21 AM

Expand comment: Christopher Moore added a comment - 17/Feb/2017 11:21 AM

Collapse comment: Matt Ryall added a comment - 17/Feb/2017 5:44 AM

Expand comment: Matt Ryall added a comment - 17/Feb/2017 5:44 AM

Collapse comment: Joshua Tjhin (Inactive) added a comment - 17/Feb/2017 5:30 AM

Expand comment: Joshua Tjhin (Inactive) added a comment - 17/Feb/2017 5:30 AM

Collapse comment: nick_viv added a comment - 17/Feb/2017 12:52 AM

Expand comment: nick_viv added a comment - 17/Feb/2017 12:52 AM

People

Dates