Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-5567

Incorrect stemming causes some words to be unsearchable

      For instance, try to search for an issue containing the word 'customer'. You'll get a bunch of hits for 'custom', even if the word is quoted.

            [JRASERVER-5567] Incorrect stemming causes some words to be unsearchable

            To everybody interested this issue,

            We've added two new indexing language options to JIRA in 6.0.5 called "English - Minimal Stemming" and "English - Moderate Stemming". The minimal stemmer uses the s-stemming algorithm and only stems plurals ending in "s". The moderate stemmer uses the KStem algorithm and uses a dictionary when it stems words to avoid conflating some words with others (for example, customer and customise). Moderate Stemming is the recommended choice for the English language and new installations of JIRA will use this indexing option by default. The existing algorithm has been renamed to "English - Aggressive Stemming" - existing installations will continue to use this stemmer until an alternative is manually specified and a reindex performed (a background reindex will work here).

            JIRA 6.1 still using the Aggressive setting will have the backing algorithm for that automatically upgraded from the Porter algorithm to the slightly more advanced Snowball algorithm which many of the non-English languages have been using.

            On a related note, stemming is a tricky business and has different requirements in different scenarios. For illustrative purposes, most instances want to treat custom and customise as the same root word (a word similar to bespoke) while some a small number of instances might have requirements that custom should refer to culture and so want to treat customise as a different word. Due to edge cases like this, we will never have a perfect "out of the box" solution for this that works for everyone. We've created a feature request at JRA-33911 to allow you to express interest if you find yourself requiring the ability to customise which words are stemmed. JRA-33911 should also serve as a good place to discuss and vote on that.

            Happy searching,
            Eric

            Eric Dalgliesh added a comment - To everybody interested this issue, We've added two new indexing language options to JIRA in 6.0.5 called "English - Minimal Stemming" and "English - Moderate Stemming". The minimal stemmer uses the s-stemming algorithm and only stems plurals ending in "s". The moderate stemmer uses the KStem algorithm and uses a dictionary when it stems words to avoid conflating some words with others (for example, customer and customise). Moderate Stemming is the recommended choice for the English language and new installations of JIRA will use this indexing option by default. The existing algorithm has been renamed to "English - Aggressive Stemming" - existing installations will continue to use this stemmer until an alternative is manually specified and a reindex performed (a background reindex will work here). JIRA 6.1 still using the Aggressive setting will have the backing algorithm for that automatically upgraded from the Porter algorithm to the slightly more advanced Snowball algorithm which many of the non-English languages have been using. On a related note, stemming is a tricky business and has different requirements in different scenarios. For illustrative purposes, most instances want to treat custom and customise as the same root word (a word similar to bespoke ) while some a small number of instances might have requirements that custom should refer to culture and so want to treat customise as a different word. Due to edge cases like this, we will never have a perfect "out of the box" solution for this that works for everyone. We've created a feature request at JRA-33911 to allow you to express interest if you find yourself requiring the ability to customise which words are stemmed. JRA-33911 should also serve as a good place to discuss and vote on that. Happy searching, Eric

            We've begun to investigate this issue but it's a big task, so I can't make any promises about delivery dates (yet). At this early stage it is still possible that we will be unable to find a reasonable solution that we can deliver in a 6.0.x timeframe. I say this because I don't want to get anybody's hopes up; we really are in the early stages of investigation.

            Please note that this is not an umbrella issue. The only thing we will be addressing under this issue is the stemming problems (for example, "customise" would no longer match "customer"). That is, wildcard matching, while similar on the surface, is a fundamentally separate issue and covered by JRA-6187. Likewise, underscore being treated as whitespace is covered by JRA-14641 and JRA-32441. There are a bunch of other issues that are similar to this on the surface but fundamentally different, so I won't list them all. Again, I don't want to get people's hopes up that more will be investigated under this issue than just what this issue describes.

            Eric Dalgliesh added a comment - We've begun to investigate this issue but it's a big task, so I can't make any promises about delivery dates (yet). At this early stage it is still possible that we will be unable to find a reasonable solution that we can deliver in a 6.0.x timeframe. I say this because I don't want to get anybody's hopes up; we really are in the early stages of investigation. Please note that this is not an umbrella issue. The only thing we will be addressing under this issue is the stemming problems (for example, "customise" would no longer match "customer"). That is, wildcard matching, while similar on the surface, is a fundamentally separate issue and covered by JRA-6187 . Likewise, underscore being treated as whitespace is covered by JRA-14641 and JRA-32441 . There are a bunch of other issues that are similar to this on the surface but fundamentally different, so I won't list them all. Again, I don't want to get people's hopes up that more will be investigated under this issue than just what this issue describes.

            I'm surprised that this issue is not receiving more attention from Atlassian.

            This is a crippling defect in the company I work for.

            Had I been responsible for choosing the replacement for my company's old wiki, this defect alone would have me crossing Confluence off my shortlist.
            Especially as this open ticket is going on 7 years now.

            Reynard Claassen added a comment - I'm surprised that this issue is not receiving more attention from Atlassian. This is a crippling defect in the company I work for. Had I been responsible for choosing the replacement for my company's old wiki, this defect alone would have me crossing Confluence off my shortlist. Especially as this open ticket is going on 7 years now.

            Just turned off the stemming in our installation because, for example, "customer" returned "customize" and "custom". Really difficult to find duplicate bugs.

            As we expand visibility into our system to a wider and wider corporate audience, I fear the need to make wildcard searches explicit, with correct syntax, is going to become a usability problem. Making stemming better; or making "stemming on vs. off" a per-user or per-use, rather than global, setting; would be very helpful.

            Marc Trudeau added a comment - Just turned off the stemming in our installation because, for example, "customer" returned "customize" and "custom". Really difficult to find duplicate bugs. As we expand visibility into our system to a wider and wider corporate audience, I fear the need to make wildcard searches explicit, with correct syntax, is going to become a usability problem. Making stemming better; or making "stemming on vs. off" a per-user or per-use, rather than global, setting; would be very helpful.

            G B added a comment -

            Thanks for the update, though that didn't answer my question of what "this" means.

            G B added a comment - Thanks for the update, though that didn't answer my question of what "this" means.

            Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3?

            Apologies, this is not going to make 4.3. I've removed '4.3' from the fix version to avoid confusion. This is something I'd like to tackle in 4.3.x but I'm hesitant to make any promises.

            Cheers,
            Peter

            Peter Leschev added a comment - Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3? Apologies, this is not going to make 4.3. I've removed '4.3' from the fix version to avoid confusion. This is something I'd like to tackle in 4.3.x but I'm hesitant to make any promises. Cheers, Peter

            G B added a comment -

            There are two problems described in this ticket.

            Problem 1) Liberal stemming behavior sometimes results in unexpected matches

            Problem 2) User's can't match an underscore in a search term or can't do a partial word search on phrases (two words joined by an underscore) due to Lucene behavior. (JRA-14641 is dedicated to this problem.)

            Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3?

            G B added a comment - There are two problems described in this ticket. Problem 1) Liberal stemming behavior sometimes results in unexpected matches Problem 2) User's can't match an underscore in a search term or can't do a partial word search on phrases (two words joined by an underscore) due to Lucene behavior. ( JRA-14641 is dedicated to this problem.) Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3?

            We have a wealth of knowledge on our Confluence system that contains underscores. This knowledge is not easily accessible using search. Like ourselves, the majority of tech companies will have a myriad of terms containing underscores that are almost impossible to locate via search.

            Ryan McCollum added a comment - We have a wealth of knowledge on our Confluence system that contains underscores. This knowledge is not easily accessible using search. Like ourselves, the majority of tech companies will have a myriad of terms containing underscores that are almost impossible to locate via search.

            bain added a comment -

            Just wanted to also point out that it would be nice if it was possible to find URLS using prefix queries. Check out JRA-17463.

            bain added a comment - Just wanted to also point out that it would be nice if it was possible to find URLS using prefix queries. Check out JRA-17463 .

            @Greg Miller

            Our current work around is to set the Lucene Indexer Language to Other in the General configuration section. This will not do any stemming. However it has a cost such that searches such as "cat" wont find "cats". Please be aware of this.

            ɹǝʞɐq pɐɹq added a comment - @Greg Miller Our current work around is to set the Lucene Indexer Language to Other in the General configuration section. This will not do any stemming. However it has a cost such that searches such as "cat" wont find "cats". Please be aware of this.

              ohernandez@atlassian.com Oswaldo Hernandez (Inactive)
              7ee5c68a815f Jeff Turner
              Affected customers:
              75 This affects my team
              Watchers:
              41 Start watching this issue

                Created:
                Updated:
                Resolved: