Uploaded image for project: 'Bitbucket Data Center'
  1. Bitbucket Data Center
  2. BSERV-7633

Race condition in content-cache's AbstractCacheEntry

    XMLWordPrintable

Details

    Description

      Symptom:

      Under a stress test, with multiple concurrent Git clones, some fail with the following error:

      fatal: The remote end hung up unexpectedly
      fatal: early EOF
      fatal: index-pack failed
      

      Stash will not log any error, however the atlassian-stash-access.log does show the anomalous clones have finished much easier than the successful clones. For example, some number of clones fail after about 46s:

      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x79x26 | - | 2015-06-22 21:39:13,378 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 46524 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x76x26 | - | 2015-06-22 21:39:13,378 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 46538 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x80x26 | - | 2015-06-22 21:39:13,378 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 46525 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x72x26 | - | 2015-06-22 21:39:13,380 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 46553 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x74x24 | - | 2015-06-22 21:39:13,380 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 46545 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x84x29 | - | 2015-06-22 21:39:13,381 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 46415 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x75x25 | - | 2015-06-22 21:39:13,381 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 46543 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x78x26 | - | 2015-06-22 21:39:13,381 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 46528 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x66x29 | - | 2015-06-22 21:39:13,382 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 46596 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x82x29 | - | 2015-06-22 21:39:13,384 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 46520 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x77x26 | - | 2015-06-22 21:39:13,384 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 46540 | - |
      

      While the rest complete without error in about 170-190s:

      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x64x27 | - | 2015-06-22 21:41:16,833 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 170055 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x65x28 | - | 2015-06-22 21:41:20,840 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 174058 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x83x29 | - | 2015-06-22 21:41:21,013 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 174136 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x70x27 | - | 2015-06-22 21:41:26,177 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 179363 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x69x25 | - | 2015-06-22 21:41:27,322 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 180508 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x56x26 | - | 2015-06-22 21:41:27,688 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 180955 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x44x15 | - | 2015-06-22 21:41:27,706 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 180991 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x61x26 | - | 2015-06-22 21:41:27,873 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 181103 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x67x27 | - | 2015-06-22 21:41:28,784 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 181990 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x63x28 | - | 2015-06-22 21:41:28,893 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 182121 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x60x25 | - | 2015-06-22 21:41:28,949 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 182178 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x68x25 | - | 2015-06-22 21:41:30,360 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 183546 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x47x18 | - | 2015-06-22 21:41:32,004 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 185287 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x62x27 | - | 2015-06-22 21:41:32,854 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 186082 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x59x26 | - | 2015-06-22 21:41:34,233 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 187480 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x73x27 | - | 2015-06-22 21:41:37,935 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 191104 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x35x6 | - | 2015-06-22 21:41:38,077 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:miss, clone | 191373 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x71x28 | - | 2015-06-22 21:41:38,214 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 191398 | - |
      0:0:0:0:0:0:0:1 | http | o@UPKJKVx1298x81x28 | - | 2015-06-22 21:41:38,829 | "POST /scm/LT/linux-clone.git/git-upload-pack HTTP/1.1" | "" "git/2.4.3" | cache:hit, clone | 191968 | - |
      

      Reproducing the Problem:

      1) Create a repository with a clone of the Linux kernel

      2) Restart Stash (The problem occurs when building the SCM upload-pack cache file. If the below clones all experience a cache hit this problem does not occur).

      3) Execute, locally on the same server as Stash, 20 concurrent git clones of the repository. These should be executed at the same instant so a script like this works best:

      #!/bin/bash
      for i in $(seq 1 20)
      do
          git clone http://localhost:7990/scm/project/linux-clone.git linux$i > log$i 2>&1 &
      done
      

      Repeat steps 2 & 3 some number of times to reproduce the problem. In my test environment I am able to reproduce it approximately 1 in 5 executions of the test.

      This problem was reproduced in the same environment described in STASH-7518 and it is necessary to have the fix for STASH-7518 applied otherwise it is much more likely you will run into that problem instead.

      It should be noted that the symptoms of this bug and STASH-7518 are quite different:

      • STASH-7518 - Some Git clones hang and the Stash access logs show the operation is completing in ~200ms
      • This bug - Some Git clones abort with an "early EOF" error and the Stash access logs show the operations is completing in many tens of seconds (typically 40-90s depending on the performance of the platform).

      Root Cause:
      In the above test, 1 clone (actually the server side thread handling the clone) will become responsible for building the upload-pack and creating the cache file. The other 19 clones will be serviced from threads reading from this cache file.

      The 19 reader threads are able to read the cache file while the writer thread is still writing it. These 19 readers will block until the first byte is read and then start reading. A race occurs between the writer and readers, and the problem described here occurs when the reader threads read to the end of the file before it is completed writing. In this case read() returns -1; this is interpreted as EOF, the data is written to the output stream and the connection closed.

      I have observed that typically the writer thread grows the file faster than the readers can consume it. However this is not always true as the first few hundred bytes of the file are written. As a result, we can observe that this failure coincides roughly with the first byte being written to the cache file.

      AbstractCacheEntry$SyncingInputStream.shouldWaitForData() attempts to prevent this problem from happening, and this check, bytesRead == bytesWritten, is aimed at ensuring we don't read when there is no data to be read. However bytesWritten merely indicates how many bytes have been written to the OutputStream. But in practice that isn't translating into a guarantee that the data is available in the file to the reader.

      Attachments

        Issue Links

          Activity

            People

              behumphreys Ben Humphreys
              behumphreys Ben Humphreys
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: