Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-23311

Attachment encoding protection still insufficient


      Once upon a time we tried to make filename encoding for attachments more robust. JRA-17805 was where that was tracked. (Pete did some stealth work on it, despite not appearing anyone on the issue he did the work; look at the source tab). After looking at a support case today and doing more investigating of the insanity that is the Unicode spec I think that the current implementation is insufficiently robust.

      Right now we do something like:

      File[] files = File.listFiles()
      new FileInputStream(files[0])

      It might seem surprising but this can still throw a FileNotFoundException. What can happen is this:

      1. Use linux.
      2. Set the user's encoding to ISO-8859-1.
      3. Create the file with ä.txt. This will be saved on disk as a sequence-of-bytes that includes the byte 228, since that's what a-umlaut is in ISO-8859-1.
      4. Switch encoding to UTF-8.
      5. Run above snippet.
      6. listFiles will see that 228 and try to represent it as UTF-8. It's not valid UTF-8 (characters above 130-whatever can only appear after the surrogate pair marker thing). So it gets replaced with EF BF BD (the Unicode replacement character). So your list will contain the file <EF><BF><BD>.txt, which usually just gets displayed as ?.txt
      7. When you try to open the file <EF><BF><BD>.txt the filesystem will naturally complain that no such file exists because there is nothing with those sequence of bytes as its name.

      I think the only way to be safe is to remove all "unsafe" characters from the attachment name. There are two possible options here:

      1. Leave ASCII characters as-is and replace the non-ASCII stuff with some kind of escaping mechanism. Instead of "Trîcky Nåme.txt" it might be stored on disk as "Tr_cky N_me.txt" or similar. That seems like a decent compromise for most European languages – they still have some semblance of the original name available. But it screws over the rest of the world. 天才。txt would just become __.txt, which is pretty useless. So I recommend we just do....
      2. Drop all pretense of keeping the original filename around. Just use the plain ID. Digits 0-9. Never have to worry about encoding issues again The downside: people who want to access the filesystem and browse attachments that way now need to go to the database to decode files. I don't think that is nearly a common enough use case to worry about.

      Note that there are also possibly some Unicode normalization issues that we are currently open to but I still haven't investigated them enough to say for certain their impact on JIRA.

            pleschev Peter Leschev
            jpendleton Justus Pendleton (Inactive)
            1 Vote for this issue
            5 Start watching this issue