Once upon a time we tried to make filename encoding for attachments more robust.
JRA-17805 was where that was tracked. (Pete did some stealth work on it, despite not appearing anyone on the issue he did the work; look at the source tab). After looking at a support case today and doing more investigating of the insanity that is the Unicode spec I think that the current implementation is insufficiently robust.
Right now we do something like:
It might seem surprising but this can still throw a FileNotFoundException. What can happen is this:
- Use linux.
- Set the user's encoding to ISO-8859-1.
- Create the file with ä.txt. This will be saved on disk as a sequence-of-bytes that includes the byte 228, since that's what a-umlaut is in ISO-8859-1.
- Switch encoding to UTF-8.
- Run above snippet.
- listFiles will see that 228 and try to represent it as UTF-8. It's not valid UTF-8 (characters above 130-whatever can only appear after the surrogate pair marker thing). So it gets replaced with EF BF BD (the Unicode replacement character). So your list will contain the file <EF><BF><BD>.txt, which usually just gets displayed as ?.txt
- When you try to open the file <EF><BF><BD>.txt the filesystem will naturally complain that no such file exists because there is nothing with those sequence of bytes as its name.
I think the only way to be safe is to remove all "unsafe" characters from the attachment name. There are two possible options here:
- Leave ASCII characters as-is and replace the non-ASCII stuff with some kind of escaping mechanism. Instead of "Trîcky Nåme.txt" it might be stored on disk as "Tr_cky N_me.txt" or similar. That seems like a decent compromise for most European languages – they still have some semblance of the original name available. But it screws over the rest of the world. 天才。txt would just become __.txt, which is pretty useless. So I recommend we just do....
- Drop all pretense of keeping the original filename around. Just use the plain ID. Digits 0-9. Never have to worry about encoding issues again The downside: people who want to access the filesystem and browse attachments that way now need to go to the database to decode files. I don't think that is nearly a common enough use case to worry about.
Note that there are also possibly some Unicode normalization issues that we are currently open to but I still haven't investigated them enough to say for certain their impact on JIRA.