[JRASERVER-5567] Incorrect stemming causes some words to be unsearchable

Type: Bug
Resolution: Fixed
Priority: Medium (View bug fix roadmap)
Fix Version/s: 6.0.5
Affects Version/s: 3.0.3, 5.2.7
Component/s: JQL
Labels:

Introduced in Version:
3
Support reference count:
3
Bug Fix Policy:
View Atlassian Server bug fix policy

For instance, try to search for an issue containing the word 'customer'. You'll get a bunch of hits for 'custom', even if the word is quoted.

has a derivative of

JRASERVER-33739 Stemming options for indexing in the english language

Closed

JRASERVER-33911 List of words to exclude from stemming during indexing

Closed

is duplicated by

JRASERVER-9240 Searching exact word matches should not ignore "common" words

Closed

JRASERVER-15006 Text-Search using Wildcards and German Umlauts does not work

Closed

JRASERVER-10887 Searching for the term "HTTPS" returns false positives.

Closed

is related to

JRASERVER-6187 wildcard search fails to find matches

Closed

JRASERVER-12947 Wildcard searching does not work on long english text

Closed

JRASERVER-14641 Impossible to distinguish between a space and an underscore in a search query

Closed

JRASERVER-19211 Changing the Indexing language does not inform the user that they must do a re-index.

Closed

JRASERVER-14574 Searching on Text Field custom field does not return the expected result

Gathering Impact

JRASERVER-13441 Provide option for partial searches in hyphen-separated numbers

Closed

JRASERVER-14712 Cannot search JIRA issue summaries containing mixed English and Japanese characters

Closed

JRASERVER-15087 Search, Quick Search doesn't find characters within a word

Closed

relates to

CONFSERVER-10856 Corrupt search with Umlaute

Closed

JRASERVER-32054 Apostrophe is not a word separator

Closed

JRASERVER-13672 Better searching when stemming is in place. Improve Lucene QueryParser to perform analysis on prefixed queries.

Closed

JRASERVER-17463 Better exact-text searching

Gathering Interest

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Wiki Page Loading...; Wiki Page Loading...; Wiki Page Loading...; Wiki Page Loading...; Page Loading...

(8 is related to, 4 relates to, 11 mentioned in)

Eric Dalgliesh added a comment - 24/Jul/2013 1:56 AM

To everybody interested this issue,

We've added two new indexing language options to JIRA in 6.0.5 called "English - Minimal Stemming" and "English - Moderate Stemming". The minimal stemmer uses the s-stemming algorithm and only stems plurals ending in "s". The moderate stemmer uses the KStem algorithm and uses a dictionary when it stems words to avoid conflating some words with others (for example, customer and customise). Moderate Stemming is the recommended choice for the English language and new installations of JIRA will use this indexing option by default. The existing algorithm has been renamed to "English - Aggressive Stemming" - existing installations will continue to use this stemmer until an alternative is manually specified and a reindex performed (a background reindex will work here).

JIRA 6.1 still using the Aggressive setting will have the backing algorithm for that automatically upgraded from the Porter algorithm to the slightly more advanced Snowball algorithm which many of the non-English languages have been using.

On a related note, stemming is a tricky business and has different requirements in different scenarios. For illustrative purposes, most instances want to treat custom and customise as the same root word (a word similar to bespoke) while some a small number of instances might have requirements that custom should refer to culture and so want to treat customise as a different word. Due to edge cases like this, we will never have a perfect "out of the box" solution for this that works for everyone. We've created a feature request at ~~JRA-33911~~ to allow you to express interest if you find yourself requiring the ability to customise which words are stemmed. ~~JRA-33911~~ should also serve as a good place to discuss and vote on that.

Happy searching,
Eric

Eric Dalgliesh added a comment - 24/Jul/2013 1:56 AM To everybody interested this issue, We've added two new indexing language options to JIRA in 6.0.5 called "English - Minimal Stemming" and "English - Moderate Stemming". The minimal stemmer uses the s-stemming algorithm and only stems plurals ending in "s". The moderate stemmer uses the KStem algorithm and uses a dictionary when it stems words to avoid conflating some words with others (for example, customer and customise). Moderate Stemming is the recommended choice for the English language and new installations of JIRA will use this indexing option by default. The existing algorithm has been renamed to "English - Aggressive Stemming" - existing installations will continue to use this stemmer until an alternative is manually specified and a reindex performed (a background reindex will work here). JIRA 6.1 still using the Aggressive setting will have the backing algorithm for that automatically upgraded from the Porter algorithm to the slightly more advanced Snowball algorithm which many of the non-English languages have been using. On a related note, stemming is a tricky business and has different requirements in different scenarios. For illustrative purposes, most instances want to treat custom and customise as the same root word (a word similar to bespoke ) while some a small number of instances might have requirements that custom should refer to culture and so want to treat customise as a different word. Due to edge cases like this, we will never have a perfect "out of the box" solution for this that works for everyone. We've created a feature request at JRA-33911 to allow you to express interest if you find yourself requiring the ability to customise which words are stemmed. JRA-33911 should also serve as a good place to discuss and vote on that. Happy searching, Eric

Eric Dalgliesh added a comment - 05/Jul/2013 12:27 AM

We've begun to investigate this issue but it's a big task, so I can't make any promises about delivery dates (yet). At this early stage it is still possible that we will be unable to find a reasonable solution that we can deliver in a 6.0.x timeframe. I say this because I don't want to get anybody's hopes up; we really are in the early stages of investigation.

Please note that this is not an umbrella issue. The only thing we will be addressing under this issue is the stemming problems (for example, "customise" would no longer match "customer"). That is, wildcard matching, while similar on the surface, is a fundamentally separate issue and covered by ~~JRA-6187~~. Likewise, underscore being treated as whitespace is covered by ~~JRA-14641~~ and JRA-32441. There are a bunch of other issues that are similar to this on the surface but fundamentally different, so I won't list them all. Again, I don't want to get people's hopes up that more will be investigated under this issue than just what this issue describes.

Eric Dalgliesh added a comment - 05/Jul/2013 12:27 AM We've begun to investigate this issue but it's a big task, so I can't make any promises about delivery dates (yet). At this early stage it is still possible that we will be unable to find a reasonable solution that we can deliver in a 6.0.x timeframe. I say this because I don't want to get anybody's hopes up; we really are in the early stages of investigation. Please note that this is not an umbrella issue. The only thing we will be addressing under this issue is the stemming problems (for example, "customise" would no longer match "customer"). That is, wildcard matching, while similar on the surface, is a fundamentally separate issue and covered by JRA-6187 . Likewise, underscore being treated as whitespace is covered by JRA-14641 and JRA-32441 . There are a bunch of other issues that are similar to this on the surface but fundamentally different, so I won't list them all. Again, I don't want to get people's hopes up that more will be investigated under this issue than just what this issue describes.

Reynard Claassen added a comment - 01/Nov/2011 5:23 PM

I'm surprised that this issue is not receiving more attention from Atlassian.

This is a crippling defect in the company I work for.

Had I been responsible for choosing the replacement for my company's old wiki, this defect alone would have me crossing Confluence off my shortlist.
Especially as this open ticket is going on 7 years now.

Reynard Claassen added a comment - 01/Nov/2011 5:23 PM I'm surprised that this issue is not receiving more attention from Atlassian. This is a crippling defect in the company I work for. Had I been responsible for choosing the replacement for my company's old wiki, this defect alone would have me crossing Confluence off my shortlist. Especially as this open ticket is going on 7 years now.

Marc Trudeau added a comment - 08/Sep/2011 1:25 PM

Just turned off the stemming in our installation because, for example, "customer" returned "customize" and "custom". Really difficult to find duplicate bugs.

As we expand visibility into our system to a wider and wider corporate audience, I fear the need to make wildcard searches explicit, with correct syntax, is going to become a usability problem. Making stemming better; or making "stemming on vs. off" a per-user or per-use, rather than global, setting; would be very helpful.

Marc Trudeau added a comment - 08/Sep/2011 1:25 PM Just turned off the stemming in our installation because, for example, "customer" returned "customize" and "custom". Really difficult to find duplicate bugs. As we expand visibility into our system to a wider and wider corporate audience, I fear the need to make wildcard searches explicit, with correct syntax, is going to become a usability problem. Making stemming better; or making "stemming on vs. off" a per-user or per-use, rather than global, setting; would be very helpful.

G B added a comment - 27/Jan/2011 9:34 PM

Thanks for the update, though that didn't answer my question of what "this" means.

G B added a comment - 27/Jan/2011 9:34 PM Thanks for the update, though that didn't answer my question of what "this" means.

Peter Leschev added a comment - 27/Jan/2011 4:54 AM

Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3?

Apologies, this is not going to make 4.3. I've removed '4.3' from the fix version to avoid confusion. This is something I'd like to tackle in 4.3.x but I'm hesitant to make any promises.

Cheers,
Peter

Peter Leschev added a comment - 27/Jan/2011 4:54 AM Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3? Apologies, this is not going to make 4.3. I've removed '4.3' from the fix version to avoid confusion. This is something I'd like to tackle in 4.3.x but I'm hesitant to make any promises. Cheers, Peter

G B added a comment - 27/Jan/2011 12:59 AM

There are two problems described in this ticket.

Problem 1) Liberal stemming behavior sometimes results in unexpected matches

Problem 2) User's can't match an underscore in a search term or can't do a partial word search on phrases (two words joined by an underscore) due to Lucene behavior. (~~JRA-14641~~ is dedicated to this problem.)

Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3?

G B added a comment - 27/Jan/2011 12:59 AM There are two problems described in this ticket. Problem 1) Liberal stemming behavior sometimes results in unexpected matches Problem 2) User's can't match an underscore in a search term or can't do a partial word search on phrases (two words joined by an underscore) due to Lucene behavior. ( JRA-14641 is dedicated to this problem.) Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3?

Ryan McCollum added a comment - 25/Feb/2010 9:11 AM

We have a wealth of knowledge on our Confluence system that contains underscores. This knowledge is not easily accessible using search. Like ourselves, the majority of tech companies will have a myriad of terms containing underscores that are almost impossible to locate via search.

Ryan McCollum added a comment - 25/Feb/2010 9:11 AM We have a wealth of knowledge on our Confluence system that contains underscores. This knowledge is not easily accessible using search. Like ourselves, the majority of tech companies will have a myriad of terms containing underscores that are almost impossible to locate via search.

bain added a comment - 18/Sep/2009 6:46 AM

Just wanted to also point out that it would be nice if it was possible to find URLS using prefix queries. Check out JRA-17463.

bain added a comment - 18/Sep/2009 6:46 AM Just wanted to also point out that it would be nice if it was possible to find URLS using prefix queries. Check out JRA-17463 .

ɹǝʞɐq pɐɹq added a comment - 09/Sep/2009 12:32 AM

@Greg Miller

Our current work around is to set the Lucene Indexer Language to Other in the General configuration section. This will not do any stemming. However it has a cost such that searches such as "cat" wont find "cats". Please be aware of this.

ɹǝʞɐq pɐɹq added a comment - 09/Sep/2009 12:32 AM @Greg Miller Our current work around is to set the Lucene Indexer Language to Other in the General configuration section. This will not do any stemming. However it has a cost such that searches such as "cat" wont find "cats". Please be aware of this.

Greg Miller added a comment - 08/Sep/2009 10:34 PM

We are also having a problem with customers searching our Technical Support documents (using Confluence) if the search query contains an underscore. The issue is listed here:

http://jira.atlassian.com/browse/CONF-14554

We can't use wildcards when "_" is in the searh term. We are getting a lot of customer complaints about this.

Is there some type of workaround?

-Greg

Greg Miller added a comment - 08/Sep/2009 10:34 PM We are also having a problem with customers searching our Technical Support documents (using Confluence) if the search query contains an underscore. The issue is listed here: http://jira.atlassian.com/browse/CONF-14554 We can't use wildcards when "_" is in the searh term. We are getting a lot of customer complaints about this. Is there some type of workaround? -Greg

G B added a comment - 14/Mar/2008 6:11 PM

Tom, you appear to have hit the same problem that I just reported in ~~JRA-14641~~. In the Lucene
query format "CSS Report" and CSS_Report are indistiguishable. Further, both are considered
search phrases and wildcards are not allowed in search phrases. Therefore it is impossible
to do the search you want.

I have started setting up saved searches where I programmatically generate a list of all possible
matching search results and insert that rediculously long list into the "Query" field in the search.

e.g. Query: CSS_Report_v1.001 OR CSS_Report_v1.002 OR CSS_Report_v1.003 OR ... OR CSS_Report_v1.999

I have successfully tested this up to 10000 enumerations. At 100000 I think the textbox in firefox
broke, so there's a limit in there somewhere.

This workaround assumes that you can enumerate the list of things that you are looking for and
that there are fewer than 10,000 of them. In most cases, we can't.

G B added a comment - 14/Mar/2008 6:11 PM Tom, you appear to have hit the same problem that I just reported in JRA-14641 . In the Lucene query format "CSS Report" and CSS_Report are indistiguishable. Further, both are considered search phrases and wildcards are not allowed in search phrases. Therefore it is impossible to do the search you want. I have started setting up saved searches where I programmatically generate a list of all possible matching search results and insert that rediculously long list into the "Query" field in the search. e.g. Query: CSS_Report_v1.001 OR CSS_Report_v1.002 OR CSS_Report_v1.003 OR ... OR CSS_Report_v1.999 I have successfully tested this up to 10000 enumerations. At 100000 I think the textbox in firefox broke, so there's a limit in there somewhere. This workaround assumes that you can enumerate the list of things that you are looking for and that there are fewer than 10,000 of them. In most cases, we can't.

Tom Clarkson added a comment - 21/Feb/2008 2:46 PM - edited

This is a serious problem (Enterprise 3.12.1) - results missing, false positives... users can't find what they need to:

I created a test project to investigate, and created 5 issues with variations on name:

CSS_Report v1.234
CSS_Reporting_v1.234
CSS_Reports_v1.234
CSS Report v1.234
CSS-Report v1.234

(note the variations in underscore, dash and space in the name)

Search Term	Expected results	Jira's results
CSS*	All 5	All 5
"CSS"	not sure (I'd expect just "CSS Report v1.234")	All 5
"CSS "	just "CSS Report v1.234"	All 5
"CSS_"	Just the three with underscores	All 5
CSS_*	Just the three with underscores	none
CSS_Report	one (CSS_Report v1.234)	CSS_Report v1.234 CSS Report v1.234 ? CSS-Report v1.234 ?
CSS_Report*	Just the three with underscores	none
"CSS_Report*"	Just the three with underscores	CSS_Report v1.234 CSS Report v1.234 ? CSS-Report v1.234

I have not yet found a search which will return just the names which start with "CSS_Report"... (ie the typical results expected from a search such as CSS_Report*)

Can anyone shed any light on how to return these results?

Tom Clarkson added a comment - 21/Feb/2008 2:46 PM - edited This is a serious problem (Enterprise 3.12.1) - results missing, false positives... users can't find what they need to: I created a test project to investigate, and created 5 issues with variations on name: CSS_Report v1.234 CSS_Reporting_v1.234 CSS_Reports_v1.234 CSS Report v1.234 CSS-Report v1.234 (note the variations in underscore, dash and space in the name) Search Term Expected results Jira's results Correct? CSS* All 5 All 5 "CSS" not sure (I'd expect just "CSS Report v1.234") All 5 "CSS " just "CSS Report v1.234" All 5 "CSS_" Just the three with underscores All 5 CSS_* Just the three with underscores none CSS_Report one (CSS_Report v1.234) CSS_Report v1.234 CSS Report v1.234 ? CSS-Report v1.234 ? CSS_Report* Just the three with underscores none "CSS_Report*" Just the three with underscores CSS_Report v1.234 CSS Report v1.234 ? CSS-Report v1.234 I have not yet found a search which will return just the names which start with "CSS_Report"... (ie the typical results expected from a search such as CSS_Report*) Can anyone shed any light on how to return these results?

Neil Arrowsmith added a comment - 16/Jan/2008 2:41 PM

Added by vote as one of our users just complained that they can't search for the word "stepped" without getting a load of results matching "step", "steps" etc., which for us is quite a big deal.

I can try out the workaround, but presumably changing the index language will then remove the ability for people to do explicit fuzzy searches and return "stepping", "steps", "stepped" etc. with a search for "step~"?

Which raises a question for me - what's the point of the fuzzy search mechanism if all searches are fuzzy anyway?

Neil Arrowsmith added a comment - 16/Jan/2008 2:41 PM Added by vote as one of our users just complained that they can't search for the word "stepped" without getting a load of results matching "step", "steps" etc., which for us is quite a big deal. I can try out the workaround, but presumably changing the index language will then remove the ability for people to do explicit fuzzy searches and return "stepping", "steps", "stepped" etc. with a search for "step~"? Which raises a question for me - what's the point of the fuzzy search mechanism if all searches are fuzzy anyway?

John M. Black added a comment - 27/Feb/2007 7:34 PM

Sorry, I posted my comments in the "duplicate" (9240) by mistake.....
============

This problem is more than just stemming, so as a matter of principle I don't want a workaround that only addresses the stemming/ignored words problem.

The root of the problem is that JIRA/Lucene/whomever does not respect the user's attempt to enter an exact phrase.. Any decent text search engine, IMHO, should:

1) allow users to switch to an "exact phrase" mode, via some preference-toggle or widget
--or, better yet,
2) allow users to surround the search terms (or any part of them) in quotes; and when encountering the quotes, treat that portion as an "untouchable" character sequence without any splitting or stemming. (This is so basic! I think by now, most web power users will instinctively try quotes for phrase searching.)

If we can do this, then all of the other symptoms in these discussions (and they are all symptoms of the same cause) will either disappear or have a highly-usable workaround. Don't want stemming? Don't want to drop words? Need to search an exact phrase? Just surround it in quotes.

John M. Black added a comment - 27/Feb/2007 7:34 PM Sorry, I posted my comments in the "duplicate" (9240) by mistake..... ============ This problem is more than just stemming, so as a matter of principle I don't want a workaround that only addresses the stemming/ignored words problem. The root of the problem is that JIRA/Lucene/whomever does not respect the user's attempt to enter an exact phrase. . Any decent text search engine, IMHO, should: 1) allow users to switch to an "exact phrase" mode, via some preference-toggle or widget --or, better yet, 2) allow users to surround the search terms (or any part of them) in quotes; and when encountering the quotes, treat that portion as an "untouchable" character sequence without any splitting or stemming. (This is so basic! I think by now, most web power users will instinctively try quotes for phrase searching.) If we can do this, then all of the other symptoms in these discussions (and they are all symptoms of the same cause) will either disappear or have a highly-usable workaround. Don't want stemming? Don't want to drop words? Need to search an exact phrase? Just surround it in quotes.

Mr Automation Guy added a comment - 28/Sep/2006 6:01 PM

Thanks Neal. You guys are the best. This is a "CLOSED" issue for me.

Mr Automation Guy added a comment - 28/Sep/2006 6:01 PM Thanks Neal. You guys are the best. This is a "CLOSED" issue for me.

Neal Applebaum added a comment - 27/Sep/2006 11:57 AM

According to the documentation:

Note: All query terms in JIRA are case insensitive.

Neal Applebaum added a comment - 27/Sep/2006 11:57 AM According to the documentation : Note: All query terms in JIRA are case insensitive.

Mr AutomationGuy added a comment - 26/Sep/2006 9:48 PM

Neil I appologize, IT WORKED, I might have done something wrong. How ever looking for '[PRODUCTION]' return "Production' as well. is there anyway too look Cas-sensetive and/or look for [PRODUCTION]. I thought looking for "[PRODUCTION]" should only return [PRODUCTION].

~Omid

Mr AutomationGuy added a comment - 26/Sep/2006 9:48 PM Neil I appologize, IT WORKED, I might have done something wrong. How ever looking for ' [PRODUCTION] ' return "Production' as well. is there anyway too look Cas-sensetive and/or look for [PRODUCTION] . I thought looking for "[PRODUCTION]" should only return [PRODUCTION] . ~Omid

Mr AutomationGuy added a comment - 26/Sep/2006 9:31 PM

Hi Neal:

To eliminate doublts I'm going to start over and do the entire process and take snapshot and place it here.

Here we go.............

~Omid

Mr AutomationGuy added a comment - 26/Sep/2006 9:31 PM Hi Neal: To eliminate doublts I'm going to start over and do the entire process and take snapshot and place it here. Here we go............. ~Omid

Neal Applebaum added a comment - 26/Sep/2006 6:50 PM

I followed Atlassian's instructions, and it worked for me (standalone) just fine. The stemming problem went away when Indexing was set to Other. Omid - your comment was a little cryptic (e.g. was the typo unintentional, how many rows were returned when searching for Production in English vs. in Other setting)?

In my test, I did a search for issues with "product" in issue summary, and it found 4 issues, including hits on "product", "products", "production". When I re-indexed with "Other" as indexing language, the search found only 2 - the 2 with exactly "product". Only when I searched on "product*" did it find all 4.

Are you sure the search didn't find some issues because the search included more fields (e.g. description, comments) where the word was also found?

Neal Applebaum added a comment - 26/Sep/2006 6:50 PM I followed Atlassian's instructions, and it worked for me (standalone) just fine. The stemming problem went away when Indexing was set to Other. Omid - your comment was a little cryptic (e.g. was the typo unintentional, how many rows were returned when searching for Production in English vs. in Other setting)? In my test, I did a search for issues with "product" in issue summary, and it found 4 issues, including hits on "product", "products", "production". When I re-indexed with "Other" as indexing language, the search found only 2 - the 2 with exactly "product". Only when I searched on "product*" did it find all 4. Are you sure the search didn't find some issues because the search included more fields (e.g. description, comments) where the word was also found?

Mr AutomationGuy added a comment - 26/Sep/2006 5:55 PM

Correct Nick:

This is the steps I have performed:

Steps to reproduce:
!) changed the "Indexing Language" in the General configuraiton to "Other" from "English"
2) re-index
3) Re-Start the service
4) Search for "[PRODUCTION]"

Result: retuned Query:
[PRODUCITON] ....
Major Product

~Omid

Mr AutomationGuy added a comment - 26/Sep/2006 5:55 PM Correct Nick: This is the steps I have performed: Steps to reproduce: !) changed the "Indexing Language" in the General configuraiton to "Other" from "English" 2) re-index 3) Re-Start the service 4) Search for " [PRODUCTION] " Result: retuned Query: [PRODUCITON] .... Major Product ~Omid

Nick Menere [Atlassian] (Inactive) added a comment - 26/Sep/2006 1:15 AM

Hi guys,
If you set the indexing language to "Other", Lucene will no longer stem words and you will have exact results after a reindex.

Omid,
Are you saying this is not the case?

Nick Menere [Atlassian] (Inactive) added a comment - 26/Sep/2006 1:15 AM Hi guys, If you set the indexing language to "Other" , Lucene will no longer stem words and you will have exact results after a reindex. Omid, Are you saying this is not the case?

Neal Applebaum added a comment - 25/Sep/2006 1:21 PM

I think Atlassian needs to respond to us about this issue. It doesn't matter if the searching is fast if it is inaccurate, IMHO. According to Atlassian, the stemming problem should go away when indexing is not set to English.

Neal Applebaum added a comment - 25/Sep/2006 1:21 PM I think Atlassian needs to respond to us about this issue. It doesn't matter if the searching is fast if it is inaccurate, IMHO. According to Atlassian, the stemming problem should go away when indexing is not set to English.

Mr Automation Guy added a comment - 24/Sep/2006 4:47 PM

Neal:

I did in-fact re-indexed and also to make sure I bounced server too the result is the same. I switched it back to "English" now. Alos I only have "English" as my choices, do I need to install "Lucene Indexer Language" or "English" is fine.

David:

After following the instructins above we are still not getting the kind of search we are hoping to get. Searching for "[PRODUCTION]" still returns "Major Product" as one of the queries. It seems strange to me that it find "Product" when I look for PRODUCTION, hmm.

~Omid

Mr Automation Guy added a comment - 24/Sep/2006 4:47 PM Neal: I did in-fact re-indexed and also to make sure I bounced server too the result is the same. I switched it back to "English" now. Alos I only have "English" as my choices, do I need to install "Lucene Indexer Language" or "English" is fine. David: After following the instructins above we are still not getting the kind of search we are hoping to get. Searching for " [PRODUCTION] " still returns "Major Product" as one of the queries. It seems strange to me that it find "Product" when I look for PRODUCTION, hmm. ~Omid

David Zawalski added a comment - 24/Sep/2006 12:09 PM

Hello folks,

The organization I work for is also experiencing great pain with this searching issue. I read through this thread of messages but could not tell if switching to "other" and re-indexing provided relief from the text searching issue. I would appreciate it if someone could comment on whether or not this works.

Thanks in advance,
Dave

David Zawalski added a comment - 24/Sep/2006 12:09 PM Hello folks, The organization I work for is also experiencing great pain with this searching issue. I read through this thread of messages but could not tell if switching to "other" and re-indexing provided relief from the text searching issue. I would appreciate it if someone could comment on whether or not this works. Thanks in advance, Dave

Neal Applebaum added a comment - 22/Sep/2006 5:16 PM

Silly question, but ... after changing the indexing language from English to Other ... did you re-index?

Neal Applebaum added a comment - 22/Sep/2006 5:16 PM Silly question, but ... after changing the indexing language from English to Other ... did you re-index?

Mr AutomationGuy added a comment - 22/Sep/2006 4:11 PM

This is still NOT WORKING for me.

The Search Indexing lanugage in my version is set to English, so when I changed it to "other" I couldn't even find a simple text. I couldn't even find any issue when I did filter.

I switched it back to English, I still have the old issue, searched for [PRODUCTION] and found "Product", "This is a production..".

~Omid

Mr AutomationGuy added a comment - 22/Sep/2006 4:11 PM This is still NOT WORKING for me. The Search Indexing lanugage in my version is set to English, so when I changed it to "other" I couldn't even find a simple text. I couldn't even find any issue when I did filter. I switched it back to English, I still have the old issue, searched for [PRODUCTION] and found "Product", "This is a production..". ~Omid

Melissa added a comment - 21/Sep/2006 8:55 PM

My environment also has this issue. In fact, just today our sr QA guy told me he had lost confidence in JIRA's search results, and that without being able to accurately find the issues he needs the rest of the functionality was meaningless. I figured others must be having some problems too, and was able to find this thread - although it was fairly time consuming and I had to hunt through a lot of other issues. Before today, I had no idea about Lucene and the stemming. It seems like it would be preferable to have installations default to 'other' so that stemming is off. It is counter intuitive to folks who are used to doing text searches, and while I can it might be beneficial for broadening searches, wild cards handle that more effectively and accurately anyway. Accurate and precise results are much more important than broad results.

Melissa added a comment - 21/Sep/2006 8:55 PM My environment also has this issue. In fact, just today our sr QA guy told me he had lost confidence in JIRA's search results, and that without being able to accurately find the issues he needs the rest of the functionality was meaningless. I figured others must be having some problems too, and was able to find this thread - although it was fairly time consuming and I had to hunt through a lot of other issues. Before today, I had no idea about Lucene and the stemming. It seems like it would be preferable to have installations default to 'other' so that stemming is off. It is counter intuitive to folks who are used to doing text searches, and while I can it might be beneficial for broadening searches, wild cards handle that more effectively and accurately anyway. Accurate and precise results are much more important than broad results.

Mr AutomationGuy added a comment - 17/Aug/2006 4:46 PM

I 2nd that. I am still waiting for some sort of solution. lol

Mr AutomationGuy added a comment - 17/Aug/2006 4:46 PM I 2nd that. I am still waiting for some sort of solution. lol

Bernard Durfee added a comment - 17/Aug/2006 1:05 PM

Any ETA on this? It was created 28/Dec/04... to bad bugs don't die of old age!

Bernard Durfee added a comment - 17/Aug/2006 1:05 PM Any ETA on this? It was created 28/Dec/04... to bad bugs don't die of old age!

AntonA added a comment - 28/Apr/2006 10:44 AM

Ted,

You are right, you will need to reindex.

If stemming is "disabled" (i.e. not done) then it will always only find exact matching, unless you use wildcards, etc. In case of wild cards I believe the ranking will be done.

Please let us know how you go.

Thanks,
Anton

AntonA added a comment - 28/Apr/2006 10:44 AM Ted, You are right, you will need to reindex. If stemming is "disabled" (i.e. not done) then it will always only find exact matching, unless you use wildcards, etc. In case of wild cards I believe the ranking will be done. Please let us know how you go. Thanks, Anton

Ted Pietrzak added a comment - 27/Apr/2006 2:35 PM

We're going to set the indexing language to

{other}

this weekend during our next maintenance window. I presume that we will have to reindex afterwards to realize the benefit of this.

I'll have a look at the dropped word business afterwards as well. I looked through a Lucene implementation and my recollection (admittedly foggy) was that the stop words were handled outside the stemming algorithm but I am prepared to be pleasantly surprised!

With respect to returning higher ranked results first, if we turn off stemming, it shoudl be possible to do that, right?

Ted

Ted Pietrzak added a comment - 27/Apr/2006 2:35 PM We're going to set the indexing language to {other} this weekend during our next maintenance window. I presume that we will have to reindex afterwards to realize the benefit of this. I'll have a look at the dropped word business afterwards as well. I looked through a Lucene implementation and my recollection (admittedly foggy) was that the stop words were handled outside the stemming algorithm but I am prepared to be pleasantly surprised! With respect to returning higher ranked results first, if we turn off stemming, it shoudl be possible to do that, right? Ted

AntonA added a comment - 27/Apr/2006 12:39 AM

Hi Ted,

By dropped words I assume you are referring to stop words mentioned on this page:
http://confluence.atlassian.com/display/JIRA/Words+ignored+when+searching

I believe if the indexing language is set to

{Other}

all of the words will be indexed and no stemming will be performed. Have you tried using this configuration? Does it make searching meet your requirements?

We really appreciate your feedback, and we will definitely take it into consideration. As Nick mentioned, removing Lucene would be quite costly at the moment. But maybe we can improve it in the future. One way that I see, is that if we can make exact matches return results that score higher. Lucene does rank results by relevance. The problem is that once the words are stemmed, Lucene does not know the difference between 'customer' and 'custom' and hence cannot rank them properly. Unfortunately, at the momnet I cannot make any promises about the implementation date for these improvements.

Please let us know if setting the indexing language to Other helps.

Thanks,
Anton

AntonA added a comment - 27/Apr/2006 12:39 AM Hi Ted, By dropped words I assume you are referring to stop words mentioned on this page: http://confluence.atlassian.com/display/JIRA/Words+ignored+when+searching I believe if the indexing language is set to {Other} all of the words will be indexed and no stemming will be performed. Have you tried using this configuration? Does it make searching meet your requirements? We really appreciate your feedback, and we will definitely take it into consideration. As Nick mentioned, removing Lucene would be quite costly at the moment. But maybe we can improve it in the future. One way that I see, is that if we can make exact matches return results that score higher. Lucene does rank results by relevance. The problem is that once the words are stemmed, Lucene does not know the difference between 'customer' and 'custom' and hence cannot rank them properly. Unfortunately, at the momnet I cannot make any promises about the implementation date for these improvements. Please let us know if setting the indexing language to Other helps. Thanks, Anton

Ted Pietrzak added a comment - 26/Apr/2006 3:08 PM

Nick,
I appreciate the thought on stemming but it is only the tip of the iceberg with respect to Lucene. Then we have the issue of words that are dropped. And enforced case insensitivity and fragility of the indices.

If Lucene is too deeply embedded in what you're doing, so be it. Give us a way to bypass it and use a more precise search mechanism. If I could just get to the varchar fields directly, that would be a huge improvement. So many of our problems could be addressed if there were a mechanism I could use to access the database engine's built-in capabilities rather than being lumbered with Lucene.

Sorry to keep beating this horse but this is a daily source of pain. I get a call or an e-mail at least once a day with some new, bizarre search result. The good news is that I've been rather forcibly educated on how Lucene works. The bad news is that the answer is all too often "Sorry, can't help you with that."

Ted

Ted Pietrzak added a comment - 26/Apr/2006 3:08 PM Nick, I appreciate the thought on stemming but it is only the tip of the iceberg with respect to Lucene. Then we have the issue of words that are dropped. And enforced case insensitivity and fragility of the indices. If Lucene is too deeply embedded in what you're doing, so be it. Give us a way to bypass it and use a more precise search mechanism. If I could just get to the varchar fields directly, that would be a huge improvement. So many of our problems could be addressed if there were a mechanism I could use to access the database engine's built-in capabilities rather than being lumbered with Lucene. Sorry to keep beating this horse but this is a daily source of pain. I get a call or an e-mail at least once a day with some new, bizarre search result. The good news is that I've been rather forcibly educated on how Lucene works. The bad news is that the answer is all too often "Sorry, can't help you with that." Ted

Nick Menere [Atlassian] (Inactive) added a comment - 26/Apr/2006 2:40 AM

Ted,

I don't think dropping Lucene is practicle for us as it is fundamental to lot of the things we do within Jira.

To get the results you guys are after, change your indexing language to {{Other}}and you will get the results you guys are expecting.
Other does not do any stemming at all (though I believe it still gets rid of special characters).

Cheers,
Nick

Nick Menere [Atlassian] (Inactive) added a comment - 26/Apr/2006 2:40 AM Ted, I don't think dropping Lucene is practicle for us as it is fundamental to lot of the things we do within Jira. To get the results you guys are after, change your indexing language to {{Other}}and you will get the results you guys are expecting. Other does not do any stemming at all (though I believe it still gets rid of special characters). Cheers, Nick

Neal Applebaum added a comment - 25/Apr/2006 7:56 PM

Amen!

Neal Applebaum added a comment - 25/Apr/2006 7:56 PM Amen!

Ted Pietrzak added a comment - 25/Apr/2006 1:39 PM

All the more reason to ditch Lucene! The results of queries are all too often seemingly inexplicable to the end user community. If you have to spend 10 minutes explaining why they got some goofy result and then tell them that there is no way to get the correct result, then something is wrong.

Lucene style fuzzy matches may be jim dandy for web search engines but that's not what we're dealing with in a defect tracking system. People want precise search results. They generally have a pretty good idea of what they're looking for. If the correct results are obscured under the weight of 100's of incorrect results, that's not useful. Lucene doesn't even offer the mitigation of "weighing" the results of a search and putting the "heaviest" results first in the list.

So, it's clear that the real benefit of Lucene is that it offers a database engine agnostic way of searching textual fields. I would cheerfully trade universal database support for a product that supports a 2-3 top notch databases and gives the right answer when I do searches.

Ted Pietrzak added a comment - 25/Apr/2006 1:39 PM All the more reason to ditch Lucene! The results of queries are all too often seemingly inexplicable to the end user community. If you have to spend 10 minutes explaining why they got some goofy result and then tell them that there is no way to get the correct result, then something is wrong. Lucene style fuzzy matches may be jim dandy for web search engines but that's not what we're dealing with in a defect tracking system. People want precise search results. They generally have a pretty good idea of what they're looking for. If the correct results are obscured under the weight of 100's of incorrect results, that's not useful. Lucene doesn't even offer the mitigation of "weighing" the results of a search and putting the "heaviest" results first in the list. So, it's clear that the real benefit of Lucene is that it offers a database engine agnostic way of searching textual fields. I would cheerfully trade universal database support for a product that supports a 2-3 top notch databases and gives the right answer when I do searches.

Nick Menere [Atlassian] (Inactive) added a comment - 21/Apr/2006 7:22 AM

at the moment the only way to do this is change the Lucene Indexer Language to Other in the General configuration section. this will not do any stemming. Stemming is a very ticky thing and I don;t think anyone will ever create a perfect English stemmer.

Nick Menere [Atlassian] (Inactive) added a comment - 21/Apr/2006 7:22 AM at the moment the only way to do this is change the Lucene Indexer Language to Other in the General configuration section. this will not do any stemming. Stemming is a very ticky thing and I don;t think anyone will ever create a perfect English stemmer.

Neal Applebaum added a comment - 20/Apr/2006 2:47 PM

See http://jira.atlassian.com/browse/JRA-6187#action_52146

Scott, how it should work is if I search on "custom" it finds only "custom". If I search on "custom*" it will find customer as well. It's not just annoying, it's wrong.

Neal Applebaum added a comment - 20/Apr/2006 2:47 PM See http://jira.atlassian.com/browse/JRA-6187#action_52146 Scott, how it should work is if I search on "custom" it finds only "custom". If I search on "custom*" it will find customer as well. It's not just annoying, it's wrong.

Mr Automation Guy added a comment - 19/Apr/2006 11:05 PM

We also have this issue and we have 1300 issues which have the word [PRODUCTION] and when we search we get all the ones that have "PRODUCT". It is very annoying.

Mr Automation Guy added a comment - 19/Apr/2006 11:05 PM We also have this issue and we have 1300 issues which have the word [PRODUCTION] and when we search we get all the ones that have "PRODUCT". It is very annoying.

Scott Farquhar added a comment - 12/Apr/2005 10:05 AM

How would you want this to work? I think that it should match both, although perhaps 'customer' matches should be higher?

Scott Farquhar added a comment - 12/Apr/2005 10:05 AM How would you want this to work? I think that it should match both, although perhaps 'customer' matches should be higher?

Assignee:: Oswaldo Hernandez (Inactive)

Reporter:: Jeff Turner

Affected customers:: 75 This affects my team

Watchers:: 41 Start watching this issue

Created:: 29/Dec/2004 4:49 AM

Updated:: 03/Feb/2021 11:56 PM

Resolved:: 23/Jul/2013 9:03 AM

Details

Description

Attachments

Issue Links

Forms

Activity

Collapse comment: Eric Dalgliesh added a comment - 24/Jul/2013 1:56 AM

Expand comment: Eric Dalgliesh added a comment - 24/Jul/2013 1:56 AM

Collapse comment: Eric Dalgliesh added a comment - 05/Jul/2013 12:27 AM

Expand comment: Eric Dalgliesh added a comment - 05/Jul/2013 12:27 AM

Collapse comment: Reynard Claassen added a comment - 01/Nov/2011 5:23 PM

Expand comment: Reynard Claassen added a comment - 01/Nov/2011 5:23 PM

Collapse comment: Marc Trudeau added a comment - 08/Sep/2011 1:25 PM

Expand comment: Marc Trudeau added a comment - 08/Sep/2011 1:25 PM

Collapse comment: G B added a comment - 27/Jan/2011 9:34 PM

Expand comment: G B added a comment - 27/Jan/2011 9:34 PM

Collapse comment: Peter Leschev added a comment - 27/Jan/2011 4:54 AM

Expand comment: Peter Leschev added a comment - 27/Jan/2011 4:54 AM

Collapse comment: G B added a comment - 27/Jan/2011 12:59 AM

Expand comment: G B added a comment - 27/Jan/2011 12:59 AM

Collapse comment: Ryan McCollum added a comment - 25/Feb/2010 9:11 AM

Expand comment: Ryan McCollum added a comment - 25/Feb/2010 9:11 AM

Collapse comment: bain added a comment - 18/Sep/2009 6:46 AM

Expand comment: bain added a comment - 18/Sep/2009 6:46 AM

Collapse comment: ɹǝʞɐq pɐɹq added a comment - 09/Sep/2009 12:32 AM

Expand comment: ɹǝʞɐq pɐɹq added a comment - 09/Sep/2009 12:32 AM

Collapse comment: Greg Miller added a comment - 08/Sep/2009 10:34 PM

Expand comment: Greg Miller added a comment - 08/Sep/2009 10:34 PM

Collapse comment: G B added a comment - 14/Mar/2008 6:11 PM

Expand comment: G B added a comment - 14/Mar/2008 6:11 PM

Collapse comment: Tom Clarkson added a comment - 21/Feb/2008 2:46 PM, Edited by Tom Clarkson - 21/Feb/2008 2:48 PM

Expand comment: Tom Clarkson added a comment - 21/Feb/2008 2:46 PM, Edited by Tom Clarkson - 21/Feb/2008 2:48 PM

Collapse comment: Neil Arrowsmith added a comment - 16/Jan/2008 2:41 PM

Expand comment: Neil Arrowsmith added a comment - 16/Jan/2008 2:41 PM

Collapse comment: John M. Black added a comment - 27/Feb/2007 7:34 PM

Expand comment: John M. Black added a comment - 27/Feb/2007 7:34 PM

Collapse comment: Mr Automation Guy added a comment - 28/Sep/2006 6:01 PM

Expand comment: Mr Automation Guy added a comment - 28/Sep/2006 6:01 PM

Collapse comment: Neal Applebaum added a comment - 27/Sep/2006 11:57 AM

Expand comment: Neal Applebaum added a comment - 27/Sep/2006 11:57 AM

Collapse comment: Mr AutomationGuy added a comment - 26/Sep/2006 9:48 PM

Expand comment: Mr AutomationGuy added a comment - 26/Sep/2006 9:48 PM

Collapse comment: Mr AutomationGuy added a comment - 26/Sep/2006 9:31 PM

Expand comment: Mr AutomationGuy added a comment - 26/Sep/2006 9:31 PM

Collapse comment: Neal Applebaum added a comment - 26/Sep/2006 6:50 PM

Expand comment: Neal Applebaum added a comment - 26/Sep/2006 6:50 PM

Collapse comment: Mr AutomationGuy added a comment - 26/Sep/2006 5:55 PM

Expand comment: Mr AutomationGuy added a comment - 26/Sep/2006 5:55 PM

Collapse comment: Nick Menere [Atlassian] (Inactive) added a comment - 26/Sep/2006 1:15 AM

Expand comment: Nick Menere [Atlassian] (Inactive) added a comment - 26/Sep/2006 1:15 AM

Collapse comment: Neal Applebaum added a comment - 25/Sep/2006 1:21 PM

Expand comment: Neal Applebaum added a comment - 25/Sep/2006 1:21 PM

Collapse comment: Mr Automation Guy added a comment - 24/Sep/2006 4:47 PM

Expand comment: Mr Automation Guy added a comment - 24/Sep/2006 4:47 PM

Collapse comment: David Zawalski added a comment - 24/Sep/2006 12:09 PM

Expand comment: David Zawalski added a comment - 24/Sep/2006 12:09 PM

Collapse comment: Neal Applebaum added a comment - 22/Sep/2006 5:16 PM

Expand comment: Neal Applebaum added a comment - 22/Sep/2006 5:16 PM

Collapse comment: Mr AutomationGuy added a comment - 22/Sep/2006 4:11 PM

Expand comment: Mr AutomationGuy added a comment - 22/Sep/2006 4:11 PM

Collapse comment: Melissa added a comment - 21/Sep/2006 8:55 PM

Expand comment: Melissa added a comment - 21/Sep/2006 8:55 PM

Collapse comment: Mr AutomationGuy added a comment - 17/Aug/2006 4:46 PM

Expand comment: Mr AutomationGuy added a comment - 17/Aug/2006 4:46 PM

Collapse comment: Bernard Durfee added a comment - 17/Aug/2006 1:05 PM

Expand comment: Bernard Durfee added a comment - 17/Aug/2006 1:05 PM

Collapse comment: AntonA added a comment - 28/Apr/2006 10:44 AM

Expand comment: AntonA added a comment - 28/Apr/2006 10:44 AM

Collapse comment: Ted Pietrzak added a comment - 27/Apr/2006 2:35 PM

Expand comment: Ted Pietrzak added a comment - 27/Apr/2006 2:35 PM

Collapse comment: AntonA added a comment - 27/Apr/2006 12:39 AM

Expand comment: AntonA added a comment - 27/Apr/2006 12:39 AM

Collapse comment: Ted Pietrzak added a comment - 26/Apr/2006 3:08 PM

Expand comment: Ted Pietrzak added a comment - 26/Apr/2006 3:08 PM

Collapse comment: Nick Menere [Atlassian] (Inactive) added a comment - 26/Apr/2006 2:40 AM

Expand comment: Nick Menere [Atlassian] (Inactive) added a comment - 26/Apr/2006 2:40 AM

Collapse comment: Neal Applebaum added a comment - 25/Apr/2006 7:56 PM

Expand comment: Neal Applebaum added a comment - 25/Apr/2006 7:56 PM

Collapse comment: Ted Pietrzak added a comment - 25/Apr/2006 1:39 PM