Runner do not accept step while still show as online after jvm OOM

XMLWordPrintable

    • 5
    • Severity 3 - Minor
    • 5,346

      Issue Summary

      This is reproducible on Data Center: (no)

      We saw after runner throw error with following error message:
      Uncaught error from RxJava

      ....

      Wrapped by: io.reactivex.exceptions.UndeliverableException: The exception could not be delivered to the consumer because it has already canceled/disposed the flow or the exception has nowhere to go to begin with. Further reading: https://github.com/ReactiveX/RxJava/wiki/What's-different-in-2.0#error-handling | {actual exception}
      After that runner would keep posting "ONLINE" message without receiving any steps.

      However, on pipeline side, pipeline will keep schedule steps to these runners because it keep sending ONLINE state, and those steps/pipelines will end up error with timed out.

      Steps to Reproduce

      This is happen randomly and hard to reproduce, but we most likely see it when the host didn't managed allocate enough memory for the runner container(esp in k8s env).

      Expected Results

      Runner should handle jvm OOM gracefully(either terminated itself or recover from OOM), the step that cause OOM should show proper error message to state it is error because of OutOfMemory error instead of error with timeout

      Actual Results

      Runner keep sending ONLINE state back to server but not accept any step schedule to the runner. Cause all steps assign to it error with timeout.

      Workaround

      Currently, we only notice it happen when OOM, and the workaround to avoid the issue is to allocate enough memory to runner container as well as the host.

      We suggest to at least give runner container 512MB of memory to avoid OOM.

            Assignee:
            Unassigned
            Reporter:
            lliang2
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: