-
Type:
Bug
-
Resolution: Fixed
-
Priority:
High
-
Component/s: Pipelines - Runners
-
None
-
5
-
Severity 3 - Minor
-
5,346
Issue Summary
This is reproducible on Data Center: (no)
We saw after runner throw error with following error message:
Uncaught error from RxJava
....
Wrapped by: io.reactivex.exceptions.UndeliverableException: The exception could not be delivered to the consumer because it has already canceled/disposed the flow or the exception has nowhere to go to begin with. Further reading: https://github.com/ReactiveX/RxJava/wiki/What's-different-in-2.0#error-handling | {actual exception}
After that runner would keep posting "ONLINE" message without receiving any steps.
However, on pipeline side, pipeline will keep schedule steps to these runners because it keep sending ONLINE state, and those steps/pipelines will end up error with timed out.
Steps to Reproduce
This is happen randomly and hard to reproduce, but we most likely see it when the host didn't managed allocate enough memory for the runner container(esp in k8s env).
Expected Results
Runner should handle jvm OOM gracefully(either terminated itself or recover from OOM), the step that cause OOM should show proper error message to state it is error because of OutOfMemory error instead of error with timeout
Actual Results
Runner keep sending ONLINE state back to server but not accept any step schedule to the runner. Cause all steps assign to it error with timeout.
Workaround
Currently, we only notice it happen when OOM, and the workaround to avoid the issue is to allocate enough memory to runner container as well as the host.
We suggest to at least give runner container 512MB of memory to avoid OOM.