Once upon a time we tried to make filename encoding for attachments more robust. JRA-17805 was where that was tracked. (Pete did some stealth work on it, despite not appearing anyone on the issue he did the work; look at the source tab). After looking at a support case today and doing more investigating of the insanity that is the Unicode spec I think that the current implementation is insufficiently robust.

      Right now we do something like:

      File[] files = File.listFiles()
      new FileInputStream(files[0])
      

      It might seem surprising but this can still throw a FileNotFoundException. What can happen is this:

      1. Use linux.
      2. Set the user's encoding to ISO-8859-1.
      3. Create the file with ä.txt. This will be saved on disk as a sequence-of-bytes that includes the byte 228, since that's what a-umlaut is in ISO-8859-1.
      4. Switch encoding to UTF-8.
      5. Run above snippet.
      6. listFiles will see that 228 and try to represent it as UTF-8. It's not valid UTF-8 (characters above 130-whatever can only appear after the surrogate pair marker thing). So it gets replaced with EF BF BD (the Unicode replacement character). So your list will contain the file <EF><BF><BD>.txt, which usually just gets displayed as ?.txt
      7. When you try to open the file <EF><BF><BD>.txt the filesystem will naturally complain that no such file exists because there is nothing with those sequence of bytes as its name.

      I think the only way to be safe is to remove all "unsafe" characters from the attachment name. There are two possible options here:

      1. Leave ASCII characters as-is and replace the non-ASCII stuff with some kind of escaping mechanism. Instead of "Trîcky Nåme.txt" it might be stored on disk as "Tr_cky N_me.txt" or similar. That seems like a decent compromise for most European languages – they still have some semblance of the original name available. But it screws over the rest of the world. 天才。txt would just become __.txt, which is pretty useless. So I recommend we just do....
      2. Drop all pretense of keeping the original filename around. Just use the plain ID. Digits 0-9. Never have to worry about encoding issues again The downside: people who want to access the filesystem and browse attachments that way now need to go to the database to decode files. I don't think that is nearly a common enough use case to worry about.

      Note that there are also possibly some Unicode normalization issues that we are currently open to but I still haven't investigated them enough to say for certain their impact on JIRA.

            [JRASERVER-23311] Attachment encoding protection still insufficient

            I forgot, that the filename column in the fileattachment table is still required to display a filename in JIRA and to send it to the client when downloading. But again, I'd appreciate to get rid of the filenames on disk.

            Oliver Siegmar added a comment - I forgot, that the filename column in the fileattachment table is still required to display a filename in JIRA and to send it to the client when downloading. But again, I'd appreciate to get rid of the filenames on disk.

            I'd really much appreciate this issue to be resolved in 4.3. We often have problems with badly encoded filenames within jira. Why not just drop the filename column in the fileattachment table and rename all attachments to the ID ?

            Oliver Siegmar added a comment - I'd really much appreciate this issue to be resolved in 4.3. We often have problems with badly encoded filenames within jira. Why not just drop the filename column in the fileattachment table and rename all attachments to the ID ?

            This change has one unfortunate change:

            Current JIRA assumes that "filename on disk" == "filename to display in URL". Obviously there's no reason for that to be true but that's what things like the DefaultThumbnailManager assume. This means that certain URLs (for thumbnails and in the issue navigator) are now just the ID. i.e.

            old: /thumbnail/10000/10000_cool_picture.jpg
            new: /thumbnail/10000/10000
            

            Back in the ancient days of the web there were browsers that didn't like those kinds of URLs for media. Things are much better nowadays. (Despite the bland URL, browsers will prompt you to save the file with the "correct" name, because we send along all the proper mime headers.) So while the URL is as nice as we'd like, we'll live with it. Fixing this would involve changes to the Thumber, which lives in atlassian-core, and would need to be coordinated with Confluence (and any other products that use the Thumber in atlassian-core).

            I think we should also consider biting the bullet and fixing up all existing attachments in an upgrade task. This would mean we could move the current "handle legacy attachment naming" from AttachmentUtils into an upgrade task. We wouldn't be able to fix broken attachments but at least we could generate a list of them and point the customer to a KB article explaining how to fix it outside of JIRA. If we decide to do this, we should also segment the directory hierarchy at the same time (to handle the 32,000+ subdirectory problem). That way there is only a single upgrade task.

            We've been afraid that doing this would be a big head ache but Confluence did it for their 3.0 release. I talk to Confluence's support team and they said they haven't had any real support problems from it. That makes me think we are overestimating how many people ever look at the attachment directory in JIRA.

            Justus Pendleton (Inactive) added a comment - This change has one unfortunate change: Current JIRA assumes that "filename on disk" == "filename to display in URL". Obviously there's no reason for that to be true but that's what things like the DefaultThumbnailManager assume. This means that certain URLs (for thumbnails and in the issue navigator) are now just the ID. i.e. old: /thumbnail/10000/10000_cool_picture.jpg new : /thumbnail/10000/10000 Back in the ancient days of the web there were browsers that didn't like those kinds of URLs for media. Things are much better nowadays. (Despite the bland URL, browsers will prompt you to save the file with the "correct" name, because we send along all the proper mime headers.) So while the URL is as nice as we'd like, we'll live with it. Fixing this would involve changes to the Thumber, which lives in atlassian-core, and would need to be coordinated with Confluence (and any other products that use the Thumber in atlassian-core). I think we should also consider biting the bullet and fixing up all existing attachments in an upgrade task. This would mean we could move the current "handle legacy attachment naming" from AttachmentUtils into an upgrade task. We wouldn't be able to fix broken attachments but at least we could generate a list of them and point the customer to a KB article explaining how to fix it outside of JIRA. If we decide to do this, we should also segment the directory hierarchy at the same time (to handle the 32,000+ subdirectory problem). That way there is only a single upgrade task. We've been afraid that doing this would be a big head ache but Confluence did it for their 3.0 release. I talk to Confluence's support team and they said they haven't had any real support problems from it. That makes me think we are overestimating how many people ever look at the attachment directory in JIRA.

            Since I've been unable to reproduce any of the normalization stuff I'm still uncertain how that would actually manifest in the real world. However, the current implementation – which no longer uses the original file name in any way, shape, or form – should be immune from any such things.

            We are still left with the problem of what to do with existing attachments. Some kind of upgrade task would be best in an ideal world but the risks (what if the customer has hundreds of thousands of attachments? does it break their incremental backups? etc) outweight the gains. So we'll continue to live with multiple legacy ways attachments can appear.

            Justus Pendleton (Inactive) added a comment - Since I've been unable to reproduce any of the normalization stuff I'm still uncertain how that would actually manifest in the real world. However, the current implementation – which no longer uses the original file name in any way, shape, or form – should be immune from any such things. We are still left with the problem of what to do with existing attachments. Some kind of upgrade task would be best in an ideal world but the risks (what if the customer has hundreds of thousands of attachments? does it break their incremental backups? etc) outweight the gains. So we'll continue to live with multiple legacy ways attachments can appear.

            Issa added a comment -

            Thank you for looking this up.

            Maybe JIRA should include a manual task which will fix the file names in case of problems ?

            Here the perl scripts I have written for our case.

            #!/usr/bin/perl
            use utf8;
            use strict;
            use DBI();
            use Unicode::Normalize;
            
            my $db = DBI->connect("DBI:mysql:database=jira;host=localhost;port=12034;mysql_socket=/tmp/mysql-jira-prd.sock",
                    "jira", "password", {'RaiseError' => 1});
            
            my $stmt = $db->prepare("select p.pkey, i.pkey, f.id, f.FILENAME
                     from jiraissue i, fileattachment f, project p
                     where i.ID = f.issueid and i.project = p.id");
            
            $stmt->execute();
            while (my @row = $stmt->fetchrow_array()) {
            #   print "Proj = $row[0], id = $row[1], attch id = $row[2], filename = $row[3]\n";
                    my $filename = "/ec/prod/server/citnet/jira/data/jira/data/attachments/"
                             . $row[0] . "/" . $row[1] . "/" . $row[2] . "_" . NFC($row[3]);
                    if (!-e $filename) {
                            print "$filename\n";
                    }
            }
            $stmt->finish();
            
            
            $db->disconnect();
            #!/usr/bin/perl
            use utf8;
            use strict;
            
            while (<STDIN>) {
                    chomp;
                    if (!-e $_) {
                            my $folder = substr($_, 0, rindex($_, "/"));
                            my $filename = substr($_, rindex($_, "/") + 1);
                            my $fileprefix = substr($filename, 0, index($filename, "_"));
            
            
                            chdir $folder;
                            my @fsfiles = glob "$fileprefix*";
                            if (scalar(@fsfiles) < 1) {
                                    print "FILE NOT FOUND:$_\n";
                            } elsif (scalar(@fsfiles) > 1) {
                                    print "MULTIPLE FILES WITH SAME ID:$_\n";
                            } else {
                                    print "RENAMING:$_";
                                    print " DB filename [$filename] FS filename [$fsfiles[0]]\n";
                                    rename $fsfiles[0], $filename;
                            }
                    }
            
                    if (!-o $_) {
                            print "NOT OWNED BY STUDIO:$_\n";
                    }
            }

            Result from first script can be piped to the second script. For some reason I don't understand, I had to cut the script in those two parts. Those scripts handle only UTF-8 locale.

            Be sure to launch it with perl params -CSDL

            Issa added a comment - Thank you for looking this up. Maybe JIRA should include a manual task which will fix the file names in case of problems ? Here the perl scripts I have written for our case. #!/usr/bin/perl use utf8; use strict; use DBI(); use Unicode::Normalize; my $db = DBI->connect( "DBI:mysql:database=jira;host=localhost;port=12034;mysql_socket=/tmp/mysql-jira-prd.sock" , "jira" , "password" , { 'RaiseError' => 1}); my $stmt = $db->prepare("select p.pkey, i.pkey, f.id, f.FILENAME from jiraissue i, fileattachment f, project p where i.ID = f.issueid and i.project = p.id"); $stmt->execute(); while (my @row = $stmt->fetchrow_array()) { # print "Proj = $row[0], id = $row[1], attch id = $row[2], filename = $row[3]\n" ; my $filename = "/ec/prod/server/citnet/jira/data/jira/data/attachments/" . $row[0] . "/" . $row[1] . "/" . $row[2] . "_" . NFC($row[3]); if (!-e $filename) { print "$filename\n" ; } } $stmt->finish(); $db->disconnect(); #!/usr/bin/perl use utf8; use strict; while (<STDIN>) { chomp; if (!-e $_) { my $folder = substr($_, 0, rindex($_, "/" )); my $filename = substr($_, rindex($_, "/" ) + 1); my $fileprefix = substr($filename, 0, index($filename, "_" )); chdir $folder; my @fsfiles = glob "$fileprefix*" ; if (scalar(@fsfiles) < 1) { print "FILE NOT FOUND:$_\n" ; } elsif (scalar(@fsfiles) > 1) { print "MULTIPLE FILES WITH SAME ID:$_\n" ; } else { print "RENAMING:$_" ; print " DB filename [$filename] FS filename [$fsfiles[0]]\n" ; rename $fsfiles[0], $filename; } } if (!-o $_) { print "NOT OWNED BY STUDIO:$_\n" ; } } Result from first script can be piped to the second script. For some reason I don't understand, I had to cut the script in those two parts. Those scripts handle only UTF-8 locale. Be sure to launch it with perl params -CSDL

            Please ensure that we provide a note in the release notes/upgrade guide about why this is happening. We will also need to let support know.

            Justus Pendleton (Inactive) added a comment - Please ensure that we provide a note in the release notes/upgrade guide about why this is happening. We will also need to let support know.

            I am unable to reproduce (on Mac OS X) the behaviour described in that Stack Overflow Unicode normalization post. I'm not sure what else is required to trigger the behaviour is mentioned there.

            Justus Pendleton (Inactive) added a comment - I am unable to reproduce (on Mac OS X) the behaviour described in that Stack Overflow Unicode normalization post. I'm not sure what else is required to trigger the behaviour is mentioned there.

            See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4866151 for a similar explanation of this same problem.

            Justus Pendleton (Inactive) added a comment - See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4866151 for a similar explanation of this same problem.

              pleschev Peter Leschev
              jpendleton Justus Pendleton (Inactive)
              Affected customers:
              1 This affects my team
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: