[CONFSERVER-11192] Underscores confuse CamelCase link interpretation

Type: Bug
Resolution: Won't Fix
Priority: Medium
Fix Version/s: None
Affects Version/s: 2.7.1, 3.4.7, 3.5.1
Component/s: None
Labels:
- affects-server
- editor

Bug Fix Policy:
View Atlassian Server bug fix policy

Underscores are being treated as whitespace by the CamelCase algorithm.

AccountInfo_PruX is interpreted as [AccountInfo]_PruX rather than [AccountInfo_PruX] (preferred), [AccountInfo]_[PruX] or AccountInfo_PruX

Matt Ryall added a comment - 27/May/2011 1:18 AM

So we need to move the CamelCase renderer component after the phrase renderers and see whether that breaks anything.

In fact, with further testing I see that it does. Anything that inserts a token in the output (such as the HTML entity or backslash handling) will end up breaking CamelCase links by changing their text – and therefore their destination! – before the CamelCase link renderer runs.

Here's what the CamelCase renderer sees when a token appears within a CamelCase word, which it interprets as a valid CamelCase link:

CamelCaseinltokxyzkdtnhgnsbdfinltok#inltokxyzkdtnhgnsbdfinltokLink

It also seems to break the behaviour of embedded images inside text (e.g. Some!...!text) for some reason I don't fully understand.

So this is going to be closed as 'Won't Fix':

changing the behaviour of the CamelCase renderer component without changing the rendering pipeline would break the behaviour of italicised camel case links
changing the behaviour of the CamelCase renderer component and reordering the pipeline so italic links work would break links for CamelCase-like text containing various special characters and the rendering of embedded images in some cases
the next major release of Confluence solely uses a rich text editor, so CamelCase linking is no longer required or supported.

Basically, the difficulty and risk of fixing it greatly outweighs the benefit of solving this particular problem with CamelCase links. Sorry for taking so long to reach this decision.

Matt Ryall added a comment - 27/May/2011 1:18 AM So we need to move the CamelCase renderer component after the phrase renderers and see whether that breaks anything. In fact, with further testing I see that it does. Anything that inserts a token in the output (such as the HTML entity or backslash handling) will end up breaking CamelCase links by changing their text – and therefore their destination! – before the CamelCase link renderer runs. Here's what the CamelCase renderer sees when a token appears within a CamelCase word, which it interprets as a valid CamelCase link: CamelCaseinltokxyzkdtnhgnsbdfinltok#inltokxyzkdtnhgnsbdfinltokLink It also seems to break the behaviour of embedded images inside text (e.g. Some!...!text ) for some reason I don't fully understand. So this is going to be closed as 'Won't Fix': changing the behaviour of the CamelCase renderer component without changing the rendering pipeline would break the behaviour of italicised camel case links changing the behaviour of the CamelCase renderer component and reordering the pipeline so italic links work would break links for CamelCase-like text containing various special characters and the rendering of embedded images in some cases the next major release of Confluence solely uses a rich text editor, so CamelCase linking is no longer required or supported. Basically, the difficulty and risk of fixing it greatly outweighs the benefit of solving this particular problem with CamelCase links. Sorry for taking so long to reach this decision.

Matt Ryall added a comment - 27/May/2011 12:35 AM

Some technical notes on this case.

I think our CamelCase renderer tries to mimic the behaviour of the old C2 wiki, which was the source of the original camel case. According to the LinkPattern page, their matching expression is this:

\b([A-Z][a-z]+){2,}\b

As you can see, this is a very limited regular expression that only works for Latin letters in the ASCII range and doesn't support numbers or other languages like Confluence does.

In Perl, this regex works the way described in the comments above with regard to underscores, skipping words completely that contain underscores:

$ perl -ne 'print /\b([A-Z][a-z]+){2,}\b/ ? "$&\n" : "(no match)\n"'
test
(no match)
testing CamelCase links
CamelCase
testing CamelCase_LinksWith underscores
(no match)
testing _CamelCase_ links
(no match)

In Confluence, the CamelCaseLinkRendererComponent has the following pattern, which supports Unicode so it's a little bit harder to read:

    // (^|[^\p{Alpha}!\^]) -- non-alpha or line-beginning before the pattern-match
    //                     -- also don't match '!' - CONF-3923
    //                     -- also don't match '^' - CONF-3447
    // ([\p{Lu}]          -- match starting with an upper-case Unicode character
    // [\p{Alnum}]+       -- potentially any number of alphanumeric characters of any case
    // [\p{L}&&[^\p{Lu}]] -- but there must be _at least one_ lower-case character
    // [\p{Alnum}]*       -- potentially followed by more alphanumerics
    // [\p{Lu}]           -- followed by an upper-case character
    // [\p{Alnum}]+)      -- and more alphanumerics to the end.
    static final Pattern LINK_CAMELCASE_PATTERN = Pattern.compile("(^|[^\\p{Alpha}!\\^])([\\p{Lu}][\\p{Alnum}]*[\\p{L}&&[^\\p{Lu}]][\\p{Alnum}]*[\\p{Lu}][\\p{Alnum}]+)", Pattern.DOTALL);

This has some special handling at the start around certain characters, but doesn't have any 'terminating character' matching or require a \b "word break" at the start or end. Adding \b at the end helps with breaking when we get to an underscore but we should test its behaviour with non-ASCII letter characters that should be part of the camel case link (so, for example, GrandesÉcoles should continue to link properly).

Adding a negative lookahead just for underscore at the end also causes our pattern to backtrack.
So in ends up converting CamelCase_WithUnderscore into CamelCase_WithUnderscore, which is clearly wrong. So we need to have a negative lookahead that includes letters, underscore and the end of input/line.

Simply adding underscore as a character that could not appear at the end of a CamelCase link also breaks CamelCase links in italic phrases, because _LinksLikeThis_ would not work as links because the CamelCase processing happens before the italic phrase processing. So we need to move the CamelCase renderer component after the phrase renderers and see whether that breaks anything.

Finally, CamelCase rendering is not the most well-tested feature in Confluence, so any change to it risks breaking stuff that is completely untested. I'm hesitant to make such a significant change to it on a bug fix release.

Matt Ryall added a comment - 27/May/2011 12:35 AM Some technical notes on this case. I think our CamelCase renderer tries to mimic the behaviour of the old C2 wiki, which was the source of the original camel case. According to the LinkPattern page , their matching expression is this: \b([A-Z][a-z]+){2,}\b As you can see, this is a very limited regular expression that only works for Latin letters in the ASCII range and doesn't support numbers or other languages like Confluence does. In Perl, this regex works the way described in the comments above with regard to underscores, skipping words completely that contain underscores: $ perl -ne 'print /\b([A-Z][a-z]+){2,}\b/ ? "$&\n" : "(no match)\n"' test (no match) testing CamelCase links CamelCase testing CamelCase_LinksWith underscores (no match) testing _CamelCase_ links (no match) In Confluence, the CamelCaseLinkRendererComponent has the following pattern, which supports Unicode so it's a little bit harder to read: // (^|[^\p{Alpha}!\^]) -- non-alpha or line-beginning before the pattern-match // -- also don't match '!' - CONF-3923 // -- also don't match '^' - CONF-3447 // ([\p{Lu}] -- match starting with an upper-case Unicode character // [\p{Alnum}]+ -- potentially any number of alphanumeric characters of any case // [\p{L}&&[^\p{Lu}]] -- but there must be _at least one_ lower-case character // [\p{Alnum}]* -- potentially followed by more alphanumerics // [\p{Lu}] -- followed by an upper-case character // [\p{Alnum}]+) -- and more alphanumerics to the end. static final Pattern LINK_CAMELCASE_PATTERN = Pattern.compile("(^|[^\\p{Alpha}!\\^])([\\p{Lu}][\\p{Alnum}]*[\\p{L}&&[^\\p{Lu}]][\\p{Alnum}]*[\\p{Lu}][\\p{Alnum}]+)", Pattern.DOTALL); This has some special handling at the start around certain characters, but doesn't have any 'terminating character' matching or require a \b "word break" at the start or end. Adding \b at the end helps with breaking when we get to an underscore but we should test its behaviour with non-ASCII letter characters that should be part of the camel case link (so, for example, GrandesÉcoles should continue to link properly). Adding a negative lookahead just for underscore at the end also causes our pattern to backtrack. So in ends up converting CamelCase_WithUnderscore into CamelCas e_WithUnderscore, which is clearly wrong. So we need to have a negative lookahead that includes letters, underscore and the end of input/line. Simply adding underscore as a character that could not appear at the end of a CamelCase link also breaks CamelCase links in italic phrases, because _LinksLikeThis_ would not work as links because the CamelCase processing happens before the italic phrase processing. So we need to move the CamelCase renderer component after the phrase renderers and see whether that breaks anything. Finally, CamelCase rendering is not the most well-tested feature in Confluence, so any change to it risks breaking stuff that is completely untested. I'm hesitant to make such a significant change to it on a bug fix release.

Rick Hadsall added a comment - 25/Mar/2008 3:02 PM

If you consider an underscore to be invalid for CamelCase, then Confluence should not take ANY part of a word that includes an underscore in it as CamelCase.

But that is not happening - it's going halfway through the word and stopping. That is a bug. Either it should ignore the full word, or, as I believe, it should include it all. Nobody is best served by half-word parsing.

ThisIsAWord
ThisIs_AWordToo <--- Confluence either should ignore the word entirely due to the hyphen, or the entire word should be a link (which is what I'd prefer)
JSP <--- TWiki reads this as CamelCase even though it technically is not.

Note: Confluence does not recognize lowerCamelCase. I don't know if you intended to ignore lowerCamelCase or not, but I wanted to mention it.
Thats why, in a perfect world, you would allow the administrator to :

Turn CamelCase linking on/off
Set whether you want to allow special characters within CamelCase links
Set whether you want to allow lowerCamelCase
Set whether you want all capital acronyms to be CamelCase links

But that's adding scope. The main issue I have, and why I believe it is a bug, is above - you either take the full word (a series of characters surrounded by whitespace) as a link, or don't... but right now, Confluence is taking half of a word.. that's not user friendly and not correct behavior.

Rick Hadsall added a comment - 25/Mar/2008 3:02 PM If you consider an underscore to be invalid for CamelCase, then Confluence should not take ANY part of a word that includes an underscore in it as CamelCase. But that is not happening - it's going halfway through the word and stopping. That is a bug. Either it should ignore the full word, or, as I believe, it should include it all. Nobody is best served by half-word parsing. ThisIsAWord ThisIs_AWordToo <--- Confluence either should ignore the word entirely due to the hyphen, or the entire word should be a link (which is what I'd prefer) JSP <--- TWiki reads this as CamelCase even though it technically is not. Note: Confluence does not recognize lowerCamelCase. I don't know if you intended to ignore lowerCamelCase or not, but I wanted to mention it. Thats why, in a perfect world, you would allow the administrator to : Turn CamelCase linking on/off Set whether you want to allow special characters within CamelCase links Set whether you want to allow lowerCamelCase Set whether you want all capital acronyms to be CamelCase links But that's adding scope. The main issue I have, and why I believe it is a bug, is above - you either take the full word (a series of characters surrounded by whitespace) as a link, or don't... but right now, Confluence is taking half of a word.. that's not user friendly and not correct behavior.

Rick Hadsall added a comment - 24/Mar/2008 3:02 PM

Actually, the problem is that Confluence's CamelCase parser is confused by underscores. So an attribute named AccountInfo_PruX, which should be recognized by Confluence internally as the equivalent to [AccountInfo_PruX] is actually being rendered by Confluence as [AccountInfo]_PruX.

This is a problem because a lot of attributes are named with underscores.

I think it's a bug; but if you want to make it an improvement, perhaps you have configuration to the CamelCase parser and allow the site owner to set camelcase on/off (already present), and if on, set whether underscores break camelcase or are included within, as well as whether all caps (e.g., ICD) are treated as camelcase or not.

Rick Hadsall added a comment - 24/Mar/2008 3:02 PM Actually, the problem is that Confluence's CamelCase parser is confused by underscores. So an attribute named AccountInfo_PruX, which should be recognized by Confluence internally as the equivalent to [AccountInfo_PruX] is actually being rendered by Confluence as [AccountInfo] _PruX. This is a problem because a lot of attributes are named with underscores. I think it's a bug; but if you want to make it an improvement, perhaps you have configuration to the CamelCase parser and allow the site owner to set camelcase on/off (already present), and if on, set whether underscores break camelcase or are included within, as well as whether all caps (e.g., ICD) are treated as camelcase or not.

Details

Description

Attachments

Forms

Activity

Collapse comment: Matt Ryall added a comment - 27/May/2011 1:18 AM

Expand comment: Matt Ryall added a comment - 27/May/2011 1:18 AM

Collapse comment: Matt Ryall added a comment - 27/May/2011 12:35 AM

Expand comment: Matt Ryall added a comment - 27/May/2011 12:35 AM

Collapse comment: Rick Hadsall added a comment - 25/Mar/2008 3:02 PM

Expand comment: Rick Hadsall added a comment - 25/Mar/2008 3:02 PM

Collapse comment: Rick Hadsall added a comment - 24/Mar/2008 3:02 PM

Expand comment: Rick Hadsall added a comment - 24/Mar/2008 3:02 PM

People

Dates