Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-5142

Stemming and wildcards do not play nicely in search queries

      There is a slight issue with searching in that if you search for a part of a word and apply a wildcard, Lucene doesn't find the word you intended.

      e.g. if you search for "Management" (no quotes) on CAC, it returns a bunch of results. A search for "Managemen*", however, only returns one.

      The reason for this is that "Managemen" is not a real English word, and so is not stemmed. So, the query term does not match the stemmed version of "management", "manag" that we have in the index, and the correct results aren't returned. (Note: the attachment returned by the wildcard query is due to the indexing of the full filename, which then matches "managemen*")

      A solution to this may be to store the original word (as well as the stemmed) in a different field in the index. When a wildcard search term comes through, search the full and stemmed words. The cache may be bigger, and there may be a slight performance hit, but it will make searching a bit more reliable in these edge cases.

            [CONFSERVER-5142] Stemming and wildcards do not play nicely in search queries

            As per dloeng@atlassian.com comments, the issues relating to stemming and the use of wildcards were fixed in 5.2 when we moved to using KStem with the limitation that wildcards on the end of a full word will not match the root of the word. So authentication* will not match authenticate. However authent* and authenticate* will be stemmed and match authentication, authenticator, authenticate and this addresses the original ticket and most of the comments.

            If there are other issues with search these are probably best addressed in a separate ticket to this one.

            Steve Lancashire (Inactive) added a comment - As per dloeng@atlassian.com comments, the issues relating to stemming and the use of wildcards were fixed in 5.2 when we moved to using KStem with the limitation that wildcards on the end of a full word will not match the root of the word. So authentication* will not match authenticate. However authent* and authenticate* will be stemmed and match authentication, authenticator, authenticate and this addresses the original ticket and most of the comments. If there are other issues with search these are probably best addressed in a separate ticket to this one.

            "Case sensitivity in wildcard searches" was apparently fixed in 5.2 (CONF-20115)

            Sergey Svishchev added a comment - "Case sensitivity in wildcard searches" was apparently fixed in 5.2 ( CONF-20115 )

            Mark Love added a comment -

            +1

            We've recently started with Confluence and I have users complaining about this.

            e.g. If you search for "repos" you get results for pages with "repository", whereas "reposi" returns nothing at all. From reading the above I suspect it is because "repos" is being treated as a valid word, whereas "reposi" is not. Very confusing.

            Mark Love added a comment - +1 We've recently started with Confluence and I have users complaining about this. e.g. If you search for "repos" you get results for pages with "repository", whereas "reposi" returns nothing at all. From reading the above I suspect it is because "repos" is being treated as a valid word, whereas "reposi" is not. Very confusing.

            dave (Inactive) added a comment - - edited
            TL;DR

            Upgrading to Confluence 5.2 or later addresses most issues raised in this ticket.

            A large number of the issues in this ticket are caused by our choice of an overly aggressive stemming algorithm in versions of Confluence prior to 5.2.

            Examples of porter stemming (used peior to 5.2):

            management ----> manag
            authenticator --> authent
            

            managemen* won't find anything because:

            • there are no terms that managemen* will expand to (if it was man* then it can expand to the word manag, but that's not the case)
            • lucene, by design, does not stem query terms that contain a wildcard. That is, management* won't stem to manag* – see here

            Confluence version 5.2 and onwards

            The issue raised here is still possible but less likely. I encourage all watchers of this ticket to upgrade to 5.2 or later and provide us with any new feedback.

            As of 5.2, Confluence uses KStem instead of Porter.

            Examples of KStem:

            management ----> management
            authenticator --> authenticate
            

            KStem is far less aggressive and importantly, it doesn't stem down to word roots but valid english words (like authenticate).

            This means that blizzard's query for authenticat* will succeed as it will be expanded to authenticate. However, authentication* still won't work. I admit the results are confusing for this query still. But there's a question mark in my mind how realistic and common that query is. It doesn't make sense to wildcard a full word.

            dave (Inactive) added a comment - - edited TL;DR Upgrading to Confluence 5.2 or later addresses most issues raised in this ticket. A large number of the issues in this ticket are caused by our choice of an overly aggressive stemming algorithm in versions of Confluence prior to 5.2. Examples of porter stemming (used peior to 5.2): management ----> manag authenticator --> authent managemen* won't find anything because: there are no terms that managemen* will expand to (if it was man* then it can expand to the word manag, but that's not the case) lucene, by design, does not stem query terms that contain a wildcard. That is, management* won't stem to manag* – see here Confluence version 5.2 and onwards The issue raised here is still possible but less likely. I encourage all watchers of this ticket to upgrade to 5.2 or later and provide us with any new feedback. As of 5.2, Confluence uses KStem instead of Porter. Examples of KStem: management ----> management authenticator --> authenticate KStem is far less aggressive and importantly, it doesn't stem down to word roots but valid english words (like authenticate). This means that blizzard's query for authenticat* will succeed as it will be expanded to authenticate. However, authentication* still won't work. I admit the results are confusing for this query still. But there's a question mark in my mind how realistic and common that query is. It doesn't make sense to wildcard a full word.

            Making the search more compliant to user expectations should really be highest prior ity. It's an integral part of every confluence instance.

            Jan Mueller added a comment - Making the search more compliant to user expectations should really be highest prior ity. It's an integral part of every confluence instance.

            Hans-Peter Geier added a comment - - edited

            2 more issues with wildcards:
            1) ? and * do not match white spaces, so for example "this?word" will not match to "this word". Neither does "this*word" match.
            2) ? does not match any character with an ascii value of > 127. The reason apparently is that UTF-8 codes these characters internally with 2 or even 4 characters. This affects french/spanish/... accents, German Umlaute, and many more characters.
            For example, Bär is found by B??r but not by B?r as you might expect.
            From a user's perspective, he should not worry about the internal representation. a ? should match one visible character, what-ever the internal code is.

            Hans-Peter Geier added a comment - - edited 2 more issues with wildcards: 1) ? and * do not match white spaces, so for example "this?word" will not match to "this word". Neither does "this*word" match. 2) ? does not match any character with an ascii value of > 127. The reason apparently is that UTF-8 codes these characters internally with 2 or even 4 characters. This affects french/spanish/... accents, German Umlaute, and many more characters. For example, Bär is found by B??r but not by B?r as you might expect. From a user's perspective, he should not worry about the internal representation. a ? should match one visible character, what-ever the internal code is.

            Han Chen added a comment -

            I've confirmed a another issue with wildcard.

            If you use wildcard in your search, the search becomes case-sensitive. i.e.
            the result for "C?nfluence" and "c?nfluence" are not the same.

            Han

            Han Chen added a comment - I've confirmed a another issue with wildcard. If you use wildcard in your search, the search becomes case-sensitive. i.e. the result for "C?nfluence" and "c?nfluence" are not the same. Han

            Anatoli added a comment -

            Piotr,

            Although in this case your comment is applicable to confluence you are probably talking about jira. Confluence also has the same setting for indexing language.

            Anatoli.

            Anatoli added a comment - Piotr, Although in this case your comment is applicable to confluence you are probably talking about jira. Confluence also has the same setting for indexing language. Anatoli.

            I've similar problem.

            Go to Administration | Global settings | edit configuration
            Try to change Jira indexing language from "english" to "other". And repeat you tests.

            Here is some description with screens : http://www.atlassian.com/software/jira/docs/v3.13/configure.html

            I've search it for a while, and it is probably connected with Lucene search mechanism which has some special searching functions when searching in English language, but not necessary great when you need exact results with wild cards characters.

            Piotr Zoladz added a comment - I've similar problem. Go to Administration | Global settings | edit configuration Try to change Jira indexing language from "english" to "other". And repeat you tests. Here is some description with screens : http://www.atlassian.com/software/jira/docs/v3.13/configure.html I've search it for a while, and it is probably connected with Lucene search mechanism which has some special searching functions when searching in English language, but not necessary great when you need exact results with wild cards characters.

            TonyA added a comment -

            In my testing, the word "commit" is a great example:

            1. "commit" (no wildcard) matches the literal word "commit" as well as the variations.
            2. The search term "commits" (no wildcard) returns all variations on "commit" and the same number of resutls
            3. "commit*" matches the literal word "commit" as well as the variations.
            4. The search term "commits*" returns a single match for the literal "commits".

            So: stemming works, reverse stemming works, stemming with a wildcard works, reverse stemming with a wildcard doesn't.

            TonyA added a comment - In my testing, the word "commit" is a great example: "commit" (no wildcard) matches the literal word "commit" as well as the variations. The search term "commits" (no wildcard) returns all variations on "commit" and the same number of resutls "commit*" matches the literal word "commit" as well as the variations. The search term "commits*" returns a single match for the literal "commits". So: stemming works, reverse stemming works, stemming with a wildcard works, reverse stemming with a wildcard doesn't.

              Unassigned Unassigned
              8d92d19feb5e Jeremy Higgs
              Affected customers:
              31 This affects my team
              Watchers:
              30 Start watching this issue

                Created:
                Updated:
                Resolved: