Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-31883

Attaching accented filenames is duplicated on the server with multiple encodings

    XMLWordPrintable

Details

    Description

      Summary

      If I take a single file and drag-and-drop it into the Attachments macro, and then take the very same file and click the "Browse for files" link to upload it again to Confluence, I end up with two files on the page with what seems to be the same name.

      Environment

      This occurs with WebKit browsers (tested with Chrome 31.0.1650.57 and Safari 7, both on OS X 10.9). This does not appear to be a problem with Firefox.

      Steps to Reproduce

      1. Download the attached file "Les Misérables.txt".
      2. Since this is a coding issue, verify that your browser hasn't munged the coding of the file as uploaded to (or downloaded from) JAC. For example:
      # cd ~/Downloads
      # ls "Les Misérables.txt" | od -t x1
      

      The results should show this:

      0000000    4c  65  73  20  4d  69  73  65  cc  81  72  61  62  6c  65  73
      0000020    2e  74  78  74  0a                                            
      0000025
      

      Note in particular the "65 cc 81". If it says "c3 a9" then the coding got munged somewhere along the way.

      If you can't download this file successfully with the right name, you can also create one from a Mac in Terminal.app by typing:

      #cat >"Les Misérables.txt"
      foo^D

      to enter the accented "e", type "<option-e>e" on any Mac that is using the standard US keyboard layout.

      1. Go to a page containing the attachments macro. Click the "browse for files" link and select the file you just downloaded/created to upload it.
      2. On the resulting page, copy the filename from the attachments macro to the clipboard with command-C.
      3. Open Terminal.app, type "pbpaste | od -t x1" and you should see:
        0000000    4c  65  73  20  4d  69  73  c3  a9  72  61  62  6c  65  73  2e
        0000020    74  78  74                                                    
        0000023
        

      Notice that the "65 cc 81" has disappeared.

      1. Go back to Confluence and drag and drop the same file into the same attachments container.
      2. Notice that there are now two copies of the file on the same page! See screenshot.
      3. Copy the text from the most recently-uploaded file to the clipboard.
      4. In Terminal.app, type "pbpaste | od -t x1" again and notice that the file has an extra byte in the filename:
        0000000    4c  65  73  20  4d  69  73  65  cc  81  72  61  62  6c  65  73
        0000020    2e  74  78  74                                                
        0000024
        

      In this case, you will see that the correct "65 cc 81" is still there.

      Expected Results

      Confluence should pick up on the fact that they are logically the same filename

      Actual Results

      The difference in encoding causes two different filenames.

      Notes

      The underlying issue is that when uploading with a particular name using the "Browse for files" link in the attachments macro, something is transcoding a Unicode "combining mark" and combining it with its base letter to an entirely new Unicode codepoint, which results in a logically different filename.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              7c60ab039b09 Scott Dudley [Inactive]
              Votes:
              5 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: