Race condition during child Git process cleanup results in the process ending up as a defunct/zombie process

XMLWordPrintable

    • 10
    • Severity 3 - Minor
    • 5

      Issue Summary

      Bitbucket uses NuProcess for external process execution.

      A child Git process may not get reaped properly and end up as a defunct/zombie process due to a race condition in the following scenario:

      The child Git process is timed out from a separate thread while the NuProcess thread is waiting for the external Git process to signal it is done.

      It is waiting on a change for the tracked process in the deadpool list in com.zaxxer.nuprocess.linux.ProcessEpoll.

      If the waiting thread is interrupted (e.g. due to a timeout), it will drop the remaining process and return and create a zombie.

      Sample data

      Git zombie process - PID 46341

      atlbitb+  46341  57558  57517  0.0  0.0      0     0 ?        Z    Wed May 18 00:20:21 2022 00:00:00 [git] <defunct>
      

       

      Logs with trace and debug logging on the com.atlassian.bitbucket.dmz.process.NioProcess and com.zaxxer.nuprocess.linux.ProcessEpoll packages:

       

      PID: 46341, request id: *78LZ5Xx20x110740121x34

      2022-05-18 00:20:22,143 TRACE [threadpool:thread-2] USER1 *78LZ5Xx20x110740121x34 qtv9f8 137.201.17.50,0:0:0:0:0:0:0:1 "GET /rest/api/latest/projects/PROJ1/repos/repo1/branches HTTP/1.0" c.a.bitbucket.dmz.process.NioProcess 46341: [/usr/bin/git rev-list --format=%H%x02%P%x02%aN%x02%aE%x02%at%x02%cN%x02%cE%x02%ct -21 --no-min-parents --stdin --no-walk=unsorted --] started (cwd: /var/atlassian/application-data/bitbucket/shared/data/repositories/101)
      2022-05-18 00:20:24,138 INFO  [http-nio-7990-exec-21] USER1 *78LZ5Xx20x110740121x34 qtv9f8 137.201.17.50,0:0:0:0:0:0:0:1 "GET /rest/api/latest/projects/PROJ1/repos/repo1/branches HTTP/1.0" c.a.s.i.r.PluginRefMetadataMapProvider Timed out when retrieving ref metadata for com.atlassian.bitbucket.server.bitbucket-branch:latest-commit-metadata
      2022-05-18 00:20:25,194 DEBUG [threadpool:thread-2] USER1 *78LZ5Xx20x110740121x34 qtv9f8 137.201.17.50,0:0:0:0:0:0:0:1 "GET /rest/api/latest/projects/PROJ1/repos/repo1/branches HTTP/1.0" c.z.nuprocess.linux.ProcessEpoll 46341: Added to deadpool
      2022-05-18 00:20:25,194 DEBUG [threadpool:thread-2] USER1 *78LZ5Xx20x110740121x34 qtv9f8 137.201.17.50,0:0:0:0:0:0:0:1 "GET /rest/api/latest/projects/PROJ1/repos/repo1/branches HTTP/1.0" c.z.nuprocess.linux.ProcessEpoll No processes left to pump
      2022-05-18 00:20:25,195 DEBUG [threadpool:thread-2] USER1 *78LZ5Xx20x110740121x34 qtv9f8 137.201.17.50,0:0:0:0:0:0:0:1 "GET /rest/api/latest/projects/PROJ1/repos/repo1/branches HTTP/1.0" c.z.nuprocess.linux.ProcessEpoll Interrupted with 1 processes still in the deadpool
      

       

      On the threadpool:thread-2 thread, NuProcess added the Git process to the deadpool list, which is a list of processes that are dead but not yet reaped, while waiting for the process to progress to its final state.

      While waiting for it, the timeout on the http-nio-7990-exec-21 thread occurred, which interrupted the wait. At this point, NuProcess dropped the process and will no longer wait for it, resulting in the zombie process.

      Steps to Reproduce

      N/A

      Expected Results

      Child Git processes are properly reaped.

      Actual Results

      Defunct/zombie Git processes are observed.

      Workaround

      • Terminate the parent process. Hence, a restart of Bitbucket Server instance cleans up the zombie processes.
      • Adjust the time out value, after confirming from trace/debug logs that a timeout occurred and interrupted the cleanup.
        For the specific sample above, where the time out occurred while retrieving ref metadata, the ref.metadata.timeout value can be raised (e.g. from 2 to 3 seconds).
         

       

            Assignee:
            Unassigned
            Reporter:
            JP Mariano
            Votes:
            9 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated: