[JRASERVER-5567] Incorrect stemming causes some words to be unsearchable

Type: Bug
Resolution: Fixed
Priority: Medium (View bug fix roadmap)
Fix Version/s: 6.0.5
Affects Version/s: 3.0.3, 5.2.7
Component/s: JQL
Labels:

Introduced in Version:
3
Support reference count:
3
Bug Fix Policy:
View Atlassian Server bug fix policy

For instance, try to search for an issue containing the word 'customer'. You'll get a bunch of hits for 'custom', even if the word is quoted.

has a derivative of

JRASERVER-33739 Stemming options for indexing in the english language

Closed

JRASERVER-33911 List of words to exclude from stemming during indexing

Closed

is duplicated by

JRASERVER-9240 Searching exact word matches should not ignore "common" words

Closed

JRASERVER-15006 Text-Search using Wildcards and German Umlauts does not work

Closed

JRASERVER-10887 Searching for the term "HTTPS" returns false positives.

Closed

is related to

JRASERVER-6187 wildcard search fails to find matches

Closed

JRASERVER-12947 Wildcard searching does not work on long english text

Closed

JRASERVER-14641 Impossible to distinguish between a space and an underscore in a search query

Closed

JRASERVER-19211 Changing the Indexing language does not inform the user that they must do a re-index.

Closed

JRASERVER-14574 Searching on Text Field custom field does not return the expected result

Gathering Impact

JRASERVER-13441 Provide option for partial searches in hyphen-separated numbers

Closed

JRASERVER-14712 Cannot search JIRA issue summaries containing mixed English and Japanese characters

Closed

JRASERVER-15087 Search, Quick Search doesn't find characters within a word

Closed

relates to

CONFSERVER-10856 Corrupt search with Umlaute

Closed

JRASERVER-32054 Apostrophe is not a word separator

Closed

JRASERVER-13672 Better searching when stemming is in place. Improve Lucene QueryParser to perform analysis on prefixed queries.

Closed

JRASERVER-17463 Better exact-text searching

Gathering Interest

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Wiki Page Loading...; Wiki Page Loading...; Wiki Page Loading...; Wiki Page Loading...; Page Loading...

(8 is related to, 4 relates to, 11 mentioned in)

Eric Dalgliesh added a comment - 24/Jul/2013 1:56 AM

To everybody interested this issue,

We've added two new indexing language options to JIRA in 6.0.5 called "English - Minimal Stemming" and "English - Moderate Stemming". The minimal stemmer uses the s-stemming algorithm and only stems plurals ending in "s". The moderate stemmer uses the KStem algorithm and uses a dictionary when it stems words to avoid conflating some words with others (for example, customer and customise). Moderate Stemming is the recommended choice for the English language and new installations of JIRA will use this indexing option by default. The existing algorithm has been renamed to "English - Aggressive Stemming" - existing installations will continue to use this stemmer until an alternative is manually specified and a reindex performed (a background reindex will work here).

JIRA 6.1 still using the Aggressive setting will have the backing algorithm for that automatically upgraded from the Porter algorithm to the slightly more advanced Snowball algorithm which many of the non-English languages have been using.

On a related note, stemming is a tricky business and has different requirements in different scenarios. For illustrative purposes, most instances want to treat custom and customise as the same root word (a word similar to bespoke) while some a small number of instances might have requirements that custom should refer to culture and so want to treat customise as a different word. Due to edge cases like this, we will never have a perfect "out of the box" solution for this that works for everyone. We've created a feature request at ~~JRA-33911~~ to allow you to express interest if you find yourself requiring the ability to customise which words are stemmed. ~~JRA-33911~~ should also serve as a good place to discuss and vote on that.

Happy searching,
Eric

Eric Dalgliesh added a comment - 24/Jul/2013 1:56 AM To everybody interested this issue, We've added two new indexing language options to JIRA in 6.0.5 called "English - Minimal Stemming" and "English - Moderate Stemming". The minimal stemmer uses the s-stemming algorithm and only stems plurals ending in "s". The moderate stemmer uses the KStem algorithm and uses a dictionary when it stems words to avoid conflating some words with others (for example, customer and customise). Moderate Stemming is the recommended choice for the English language and new installations of JIRA will use this indexing option by default. The existing algorithm has been renamed to "English - Aggressive Stemming" - existing installations will continue to use this stemmer until an alternative is manually specified and a reindex performed (a background reindex will work here). JIRA 6.1 still using the Aggressive setting will have the backing algorithm for that automatically upgraded from the Porter algorithm to the slightly more advanced Snowball algorithm which many of the non-English languages have been using. On a related note, stemming is a tricky business and has different requirements in different scenarios. For illustrative purposes, most instances want to treat custom and customise as the same root word (a word similar to bespoke ) while some a small number of instances might have requirements that custom should refer to culture and so want to treat customise as a different word. Due to edge cases like this, we will never have a perfect "out of the box" solution for this that works for everyone. We've created a feature request at JRA-33911 to allow you to express interest if you find yourself requiring the ability to customise which words are stemmed. JRA-33911 should also serve as a good place to discuss and vote on that. Happy searching, Eric

Eric Dalgliesh added a comment - 05/Jul/2013 12:27 AM

We've begun to investigate this issue but it's a big task, so I can't make any promises about delivery dates (yet). At this early stage it is still possible that we will be unable to find a reasonable solution that we can deliver in a 6.0.x timeframe. I say this because I don't want to get anybody's hopes up; we really are in the early stages of investigation.

Please note that this is not an umbrella issue. The only thing we will be addressing under this issue is the stemming problems (for example, "customise" would no longer match "customer"). That is, wildcard matching, while similar on the surface, is a fundamentally separate issue and covered by ~~JRA-6187~~. Likewise, underscore being treated as whitespace is covered by ~~JRA-14641~~ and JRA-32441. There are a bunch of other issues that are similar to this on the surface but fundamentally different, so I won't list them all. Again, I don't want to get people's hopes up that more will be investigated under this issue than just what this issue describes.

Eric Dalgliesh added a comment - 05/Jul/2013 12:27 AM We've begun to investigate this issue but it's a big task, so I can't make any promises about delivery dates (yet). At this early stage it is still possible that we will be unable to find a reasonable solution that we can deliver in a 6.0.x timeframe. I say this because I don't want to get anybody's hopes up; we really are in the early stages of investigation. Please note that this is not an umbrella issue. The only thing we will be addressing under this issue is the stemming problems (for example, "customise" would no longer match "customer"). That is, wildcard matching, while similar on the surface, is a fundamentally separate issue and covered by JRA-6187 . Likewise, underscore being treated as whitespace is covered by JRA-14641 and JRA-32441 . There are a bunch of other issues that are similar to this on the surface but fundamentally different, so I won't list them all. Again, I don't want to get people's hopes up that more will be investigated under this issue than just what this issue describes.

Reynard Claassen added a comment - 01/Nov/2011 5:23 PM

I'm surprised that this issue is not receiving more attention from Atlassian.

This is a crippling defect in the company I work for.

Had I been responsible for choosing the replacement for my company's old wiki, this defect alone would have me crossing Confluence off my shortlist.
Especially as this open ticket is going on 7 years now.

Reynard Claassen added a comment - 01/Nov/2011 5:23 PM I'm surprised that this issue is not receiving more attention from Atlassian. This is a crippling defect in the company I work for. Had I been responsible for choosing the replacement for my company's old wiki, this defect alone would have me crossing Confluence off my shortlist. Especially as this open ticket is going on 7 years now.

Marc Trudeau added a comment - 08/Sep/2011 1:25 PM

Just turned off the stemming in our installation because, for example, "customer" returned "customize" and "custom". Really difficult to find duplicate bugs.

As we expand visibility into our system to a wider and wider corporate audience, I fear the need to make wildcard searches explicit, with correct syntax, is going to become a usability problem. Making stemming better; or making "stemming on vs. off" a per-user or per-use, rather than global, setting; would be very helpful.

Marc Trudeau added a comment - 08/Sep/2011 1:25 PM Just turned off the stemming in our installation because, for example, "customer" returned "customize" and "custom". Really difficult to find duplicate bugs. As we expand visibility into our system to a wider and wider corporate audience, I fear the need to make wildcard searches explicit, with correct syntax, is going to become a usability problem. Making stemming better; or making "stemming on vs. off" a per-user or per-use, rather than global, setting; would be very helpful.

G B added a comment - 27/Jan/2011 9:34 PM

Thanks for the update, though that didn't answer my question of what "this" means.

G B added a comment - 27/Jan/2011 9:34 PM Thanks for the update, though that didn't answer my question of what "this" means.

Peter Leschev added a comment - 27/Jan/2011 4:54 AM

Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3?

Apologies, this is not going to make 4.3. I've removed '4.3' from the fix version to avoid confusion. This is something I'd like to tackle in 4.3.x but I'm hesitant to make any promises.

Cheers,
Peter

Peter Leschev added a comment - 27/Jan/2011 4:54 AM Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3? Apologies, this is not going to make 4.3. I've removed '4.3' from the fix version to avoid confusion. This is something I'd like to tackle in 4.3.x but I'm hesitant to make any promises. Cheers, Peter

G B added a comment - 27/Jan/2011 12:59 AM

There are two problems described in this ticket.

Problem 1) Liberal stemming behavior sometimes results in unexpected matches

Problem 2) User's can't match an underscore in a search term or can't do a partial word search on phrases (two words joined by an underscore) due to Lucene behavior. (~~JRA-14641~~ is dedicated to this problem.)

Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3?

G B added a comment - 27/Jan/2011 12:59 AM There are two problems described in this ticket. Problem 1) Liberal stemming behavior sometimes results in unexpected matches Problem 2) User's can't match an underscore in a search term or can't do a partial word search on phrases (two words joined by an underscore) due to Lucene behavior. ( JRA-14641 is dedicated to this problem.) Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3?

Ryan McCollum added a comment - 25/Feb/2010 9:11 AM

We have a wealth of knowledge on our Confluence system that contains underscores. This knowledge is not easily accessible using search. Like ourselves, the majority of tech companies will have a myriad of terms containing underscores that are almost impossible to locate via search.

Ryan McCollum added a comment - 25/Feb/2010 9:11 AM We have a wealth of knowledge on our Confluence system that contains underscores. This knowledge is not easily accessible using search. Like ourselves, the majority of tech companies will have a myriad of terms containing underscores that are almost impossible to locate via search.

bain added a comment - 18/Sep/2009 6:46 AM

Just wanted to also point out that it would be nice if it was possible to find URLS using prefix queries. Check out JRA-17463.

bain added a comment - 18/Sep/2009 6:46 AM Just wanted to also point out that it would be nice if it was possible to find URLS using prefix queries. Check out JRA-17463 .

ɹǝʞɐq pɐɹq added a comment - 09/Sep/2009 12:32 AM

@Greg Miller

Our current work around is to set the Lucene Indexer Language to Other in the General configuration section. This will not do any stemming. However it has a cost such that searches such as "cat" wont find "cats". Please be aware of this.

ɹǝʞɐq pɐɹq added a comment - 09/Sep/2009 12:32 AM @Greg Miller Our current work around is to set the Lucene Indexer Language to Other in the General configuration section. This will not do any stemming. However it has a cost such that searches such as "cat" wont find "cats". Please be aware of this.

Assignee:: Oswaldo Hernandez (Inactive)

Reporter:: Jeff Turner

Affected customers:: 75 This affects my team

Watchers:: 41 Start watching this issue

Created:: 29/Dec/2004 4:49 AM

Updated:: 03/Feb/2021 11:56 PM

Resolved:: 23/Jul/2013 9:03 AM

Details

Description

Attachments

Issue Links

Forms

Activity

Collapse comment: Eric Dalgliesh added a comment - 24/Jul/2013 1:56 AM

Expand comment: Eric Dalgliesh added a comment - 24/Jul/2013 1:56 AM

Collapse comment: Eric Dalgliesh added a comment - 05/Jul/2013 12:27 AM

Expand comment: Eric Dalgliesh added a comment - 05/Jul/2013 12:27 AM

Collapse comment: Reynard Claassen added a comment - 01/Nov/2011 5:23 PM

Expand comment: Reynard Claassen added a comment - 01/Nov/2011 5:23 PM

Collapse comment: Marc Trudeau added a comment - 08/Sep/2011 1:25 PM

Expand comment: Marc Trudeau added a comment - 08/Sep/2011 1:25 PM

Collapse comment: G B added a comment - 27/Jan/2011 9:34 PM

Expand comment: G B added a comment - 27/Jan/2011 9:34 PM

Collapse comment: Peter Leschev added a comment - 27/Jan/2011 4:54 AM

Expand comment: Peter Leschev added a comment - 27/Jan/2011 4:54 AM

Collapse comment: G B added a comment - 27/Jan/2011 12:59 AM

Expand comment: G B added a comment - 27/Jan/2011 12:59 AM

Collapse comment: Ryan McCollum added a comment - 25/Feb/2010 9:11 AM

Expand comment: Ryan McCollum added a comment - 25/Feb/2010 9:11 AM

Collapse comment: bain added a comment - 18/Sep/2009 6:46 AM

Expand comment: bain added a comment - 18/Sep/2009 6:46 AM

Collapse comment: ɹǝʞɐq pɐɹq added a comment - 09/Sep/2009 12:32 AM

Expand comment: ɹǝʞɐq pɐɹq added a comment - 09/Sep/2009 12:32 AM

People

Dates