Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: Low
Fix Version/s: 4.3, Bugfix Release
Affects Version/s: 4.2
Component/s: None
Labels:
- affects-server
- interesting

Introduced in Version:
4.02
Bug Fix Policy:
View Atlassian Server bug fix policy

Once upon a time we tried to make filename encoding for attachments more robust. ~~JRA-17805~~ was where that was tracked. (Pete did some stealth work on it, despite not appearing anyone on the issue he did the work; look at the source tab). After looking at a support case today and doing more investigating of the insanity that is the Unicode spec I think that the current implementation is insufficiently robust.

Right now we do something like:

File[] files = File.listFiles()
new FileInputStream(files[0])

It might seem surprising but this can still throw a FileNotFoundException. What can happen is this:

Use linux.
Set the user's encoding to ISO-8859-1.
Create the file with ä.txt. This will be saved on disk as a sequence-of-bytes that includes the byte 228, since that's what a-umlaut is in ISO-8859-1.
Switch encoding to UTF-8.
Run above snippet.
listFiles will see that 228 and try to represent it as UTF-8. It's not valid UTF-8 (characters above 130-whatever can only appear after the surrogate pair marker thing). So it gets replaced with EF BF BD (the Unicode replacement character). So your list will contain the file <EF><BF><BD>.txt, which usually just gets displayed as ?.txt
When you try to open the file <EF><BF><BD>.txt the filesystem will naturally complain that no such file exists because there is nothing with those sequence of bytes as its name.

I think the only way to be safe is to remove all "unsafe" characters from the attachment name. There are two possible options here:

Leave ASCII characters as-is and replace the non-ASCII stuff with some kind of escaping mechanism. Instead of "Trîcky Nåme.txt" it might be stored on disk as "Tr_cky N_me.txt" or similar. That seems like a decent compromise for most European languages – they still have some semblance of the original name available. But it screws over the rest of the world. 天才。txt would just become __.txt, which is pretty useless. So I recommend we just do....
Drop all pretense of keeping the original filename around. Just use the plain ID. Digits 0-9. Never have to worry about encoding issues again The downside: people who want to access the filesystem and browse attachments that way now need to go to the database to decode files. I don't think that is nearly a common enough use case to worry about.

Note that there are also possibly some Unicode normalization issues that we are currently open to but I still haven't investigated them enough to say for certain their impact on JIRA.

is related to

JRASERVER-23830 Attachments lost when performing a bulk move, single issue move, or project import

Closed

JRASERVER-16009 Strange encoding issue when file name of attachment contains funny characters

Closed

JRASERVER-19873 attachment directory structure has exceeded maximum file handles

Closed

Assignee:: Peter Leschev

Reporter:: Justus Pendleton (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 20/Dec/2010 4:14 AM

Updated:: 28/Mar/2019 12:01 AM

Resolved:: 15/Mar/2011 10:16 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates