Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-2461

wrong URL encoding of non-ASCII chars when redirected to space homepage

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Medium Medium
    • 1.4
    • 1.3
    • None
    • Solaris 9, Tomcat 5.0.x, UTF-8 encoding set in Confluence and also using -Dfile.encoding

      Reproduction:
      1. Create a new space.
      2. Rename space's home page. Use non-ascii characters in the new name, for instance č (c with carret). Let's say we name it "Home of čžš" (without quote marks).
      3. Go to the dashboar.

      In the "Your Spaces" portlet, there are two ways of accessing the space's home page. One is a link the "Space (Key)" column, another is the home icon in "Operations" column. Second method works, as the URL has been changed to "webapp style", e.g. http://example.com/confluence/pages/viewpage.action?pageId=99. The first method sends you to space URL (e.g. http://example.com/confluence/display/PES, where PES is the space key), which then redirects you (via HTTP 302) to the "wiki style" URL, in this case http://example.com/confluence/display/PES/Home%2Bof%2B%25C4%258D%25C5%25BE%25C5%25A1).

      The problem with this URL is that it does not conform to RFC 2396, paragraph 2.4.1. In practice, this screws Apache's mod_proxy big time.

      I suspect that in the first method the URL went twice through URLEncoder. Java's URLEncoder encodes "Home of čžš" as "Home+of+%C4%8D%C5%BE%C5%A1". That would explain why letter č is encoded as %25C4%258D instead of %C4%8D. Though manually stripping '25' does not yield an URL that Confluence recognizes...

            [CONFSERVER-2461] wrong URL encoding of non-ASCII chars when redirected to space homepage

            We need to hunt down and kill any use of doubleUrlEncode for page URLs, and replace it with page.getUrlPath()

            Charles Miller (Inactive) added a comment - We need to hunt down and kill any use of doubleUrlEncode for page URLs, and replace it with page.getUrlPath()

            The "wiki-style" URLs are friendlier, so we try to use them whenever possible.

            Double-encoding was a sneaky way to get around the fact that every application server used a different character encoding for high-bit characters. It's perfectly legal, but it's ugly and caused problems of its own, which is why we switched to the webapp-style URLs for "dangerous" page titles instead. Unfortunately, it seems that we didn't catch everywhere this was being used.

            Should be simple to fix.

            Charles Miller (Inactive) added a comment - The "wiki-style" URLs are friendlier, so we try to use them whenever possible. Double-encoding was a sneaky way to get around the fact that every application server used a different character encoding for high-bit characters. It's perfectly legal, but it's ugly and caused problems of its own, which is why we switched to the webapp-style URLs for "dangerous" page titles instead. Unfortunately, it seems that we didn't catch everywhere this was being used. Should be simple to fix.

            The same happens if I try to add an attachment to a page containing non ASCII characters.
            After browsing for file, and pressing "Attach", I get "Page not found" from Confluence.
            The attachment was added, but the redirect failed.
            This is the URL it opens:
            http://localhost/confluence/display/INFO/Testmilj%25C3%25B8?showAttachments=true#attachments
            This would be the correct URL:
            http://localhost/confluence/display/INFO/Testmilj%F8?showAttachments=true#attachments

            Ronny Pettersen added a comment - The same happens if I try to add an attachment to a page containing non ASCII characters. After browsing for file, and pressing "Attach", I get "Page not found" from Confluence. The attachment was added, but the redirect failed. This is the URL it opens: http://localhost/confluence/display/INFO/Testmilj%25C3%25B8?showAttachments=true#attachments This would be the correct URL: http://localhost/confluence/display/INFO/Testmilj%F8?showAttachments=true#attachments

            Err, not uuencoded, but URL-encoded.

            Deleted Account (Inactive) added a comment - Err, not uuencoded, but URL-encoded.

            The wrong URL is the third one, that is http://example.com/confluence/display/PES/Home%2Bof%2B%25C4%258D%25C5%25BE%25C5%25A1. As you can see, there are 4 characters after %, for instance %25C4. If this were a correct URL, then its unencoded form would be http://example.com/confluence/display/PES/Home+of+%C4%8D%C5%BE%C5%A1. Of course, typing this into browser (and replacing example.com with the real host) results in a 404.

            Note that %25 is the uuencoded form of '%'.

            Deleted Account (Inactive) added a comment - The wrong URL is the third one, that is http://example.com/confluence/display/PES/Home%2Bof%2B%25C4%258D%25C5%25BE%25C5%25A1 . As you can see, there are 4 characters after %, for instance %25C4. If this were a correct URL, then its unencoded form would be http://example.com/confluence/display/PES/Home+of+%C4%8D%C5%BE%C5%A1 . Of course, typing this into browser (and replacing example.com with the real host) results in a 404. Note that %25 is the uuencoded form of '%'.

            I'm not sure why you say that this violates the RFC - as far as I can tell, this is a valid URL. To quote section 2.4.1:

            "An escaped octet is encoded as a character triplet, consisting of the
            percent character "%" followed by the two hexadecimal digits
            representing the octet code. For example, "%20" is the escaped
            encoding for the US-ASCII space character."

            In the URL above - there are two characters after every '%'?

            Scott Farquhar added a comment - I'm not sure why you say that this violates the RFC - as far as I can tell, this is a valid URL. To quote section 2.4.1: "An escaped octet is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing the octet code. For example, "%20" is the escaped encoding for the US-ASCII space character." In the URL above - there are two characters after every '%'?

            There is little use for "Wiki style" URLs for pages with non-ASCII characters in title. Therefore you should fix the redirect URL to "webapp style".

            BTW: it seems that both methods are functionally the same, so why not get rid of one?

            Deleted Account (Inactive) added a comment - There is little use for "Wiki style" URLs for pages with non-ASCII characters in title. Therefore you should fix the redirect URL to "webapp style". BTW: it seems that both methods are functionally the same, so why not get rid of one?

              Unassigned Unassigned
              8bcb1553-196a-4cae-9387-2e155042a50a Deleted Account (Inactive)
              Affected customers:
              3 This affects my team
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: