Uploaded image for project: 'Jira Data Center'
  1. Jira Data Center
  2. JRASERVER-5567

Incorrect stemming causes some words to be unsearchable

      For instance, try to search for an issue containing the word 'customer'. You'll get a bunch of hits for 'custom', even if the word is quoted.

            [JRASERVER-5567] Incorrect stemming causes some words to be unsearchable

            To everybody interested this issue,

            We've added two new indexing language options to JIRA in 6.0.5 called "English - Minimal Stemming" and "English - Moderate Stemming". The minimal stemmer uses the s-stemming algorithm and only stems plurals ending in "s". The moderate stemmer uses the KStem algorithm and uses a dictionary when it stems words to avoid conflating some words with others (for example, customer and customise). Moderate Stemming is the recommended choice for the English language and new installations of JIRA will use this indexing option by default. The existing algorithm has been renamed to "English - Aggressive Stemming" - existing installations will continue to use this stemmer until an alternative is manually specified and a reindex performed (a background reindex will work here).

            JIRA 6.1 still using the Aggressive setting will have the backing algorithm for that automatically upgraded from the Porter algorithm to the slightly more advanced Snowball algorithm which many of the non-English languages have been using.

            On a related note, stemming is a tricky business and has different requirements in different scenarios. For illustrative purposes, most instances want to treat custom and customise as the same root word (a word similar to bespoke) while some a small number of instances might have requirements that custom should refer to culture and so want to treat customise as a different word. Due to edge cases like this, we will never have a perfect "out of the box" solution for this that works for everyone. We've created a feature request at JRA-33911 to allow you to express interest if you find yourself requiring the ability to customise which words are stemmed. JRA-33911 should also serve as a good place to discuss and vote on that.

            Happy searching,
            Eric

            Eric Dalgliesh added a comment - To everybody interested this issue, We've added two new indexing language options to JIRA in 6.0.5 called "English - Minimal Stemming" and "English - Moderate Stemming". The minimal stemmer uses the s-stemming algorithm and only stems plurals ending in "s". The moderate stemmer uses the KStem algorithm and uses a dictionary when it stems words to avoid conflating some words with others (for example, customer and customise). Moderate Stemming is the recommended choice for the English language and new installations of JIRA will use this indexing option by default. The existing algorithm has been renamed to "English - Aggressive Stemming" - existing installations will continue to use this stemmer until an alternative is manually specified and a reindex performed (a background reindex will work here). JIRA 6.1 still using the Aggressive setting will have the backing algorithm for that automatically upgraded from the Porter algorithm to the slightly more advanced Snowball algorithm which many of the non-English languages have been using. On a related note, stemming is a tricky business and has different requirements in different scenarios. For illustrative purposes, most instances want to treat custom and customise as the same root word (a word similar to bespoke ) while some a small number of instances might have requirements that custom should refer to culture and so want to treat customise as a different word. Due to edge cases like this, we will never have a perfect "out of the box" solution for this that works for everyone. We've created a feature request at JRA-33911 to allow you to express interest if you find yourself requiring the ability to customise which words are stemmed. JRA-33911 should also serve as a good place to discuss and vote on that. Happy searching, Eric

            We've begun to investigate this issue but it's a big task, so I can't make any promises about delivery dates (yet). At this early stage it is still possible that we will be unable to find a reasonable solution that we can deliver in a 6.0.x timeframe. I say this because I don't want to get anybody's hopes up; we really are in the early stages of investigation.

            Please note that this is not an umbrella issue. The only thing we will be addressing under this issue is the stemming problems (for example, "customise" would no longer match "customer"). That is, wildcard matching, while similar on the surface, is a fundamentally separate issue and covered by JRA-6187. Likewise, underscore being treated as whitespace is covered by JRA-14641 and JRA-32441. There are a bunch of other issues that are similar to this on the surface but fundamentally different, so I won't list them all. Again, I don't want to get people's hopes up that more will be investigated under this issue than just what this issue describes.

            Eric Dalgliesh added a comment - We've begun to investigate this issue but it's a big task, so I can't make any promises about delivery dates (yet). At this early stage it is still possible that we will be unable to find a reasonable solution that we can deliver in a 6.0.x timeframe. I say this because I don't want to get anybody's hopes up; we really are in the early stages of investigation. Please note that this is not an umbrella issue. The only thing we will be addressing under this issue is the stemming problems (for example, "customise" would no longer match "customer"). That is, wildcard matching, while similar on the surface, is a fundamentally separate issue and covered by JRA-6187 . Likewise, underscore being treated as whitespace is covered by JRA-14641 and JRA-32441 . There are a bunch of other issues that are similar to this on the surface but fundamentally different, so I won't list them all. Again, I don't want to get people's hopes up that more will be investigated under this issue than just what this issue describes.

            I'm surprised that this issue is not receiving more attention from Atlassian.

            This is a crippling defect in the company I work for.

            Had I been responsible for choosing the replacement for my company's old wiki, this defect alone would have me crossing Confluence off my shortlist.
            Especially as this open ticket is going on 7 years now.

            Reynard Claassen added a comment - I'm surprised that this issue is not receiving more attention from Atlassian. This is a crippling defect in the company I work for. Had I been responsible for choosing the replacement for my company's old wiki, this defect alone would have me crossing Confluence off my shortlist. Especially as this open ticket is going on 7 years now.

            Just turned off the stemming in our installation because, for example, "customer" returned "customize" and "custom". Really difficult to find duplicate bugs.

            As we expand visibility into our system to a wider and wider corporate audience, I fear the need to make wildcard searches explicit, with correct syntax, is going to become a usability problem. Making stemming better; or making "stemming on vs. off" a per-user or per-use, rather than global, setting; would be very helpful.

            Marc Trudeau added a comment - Just turned off the stemming in our installation because, for example, "customer" returned "customize" and "custom". Really difficult to find duplicate bugs. As we expand visibility into our system to a wider and wider corporate audience, I fear the need to make wildcard searches explicit, with correct syntax, is going to become a usability problem. Making stemming better; or making "stemming on vs. off" a per-user or per-use, rather than global, setting; would be very helpful.

            G B added a comment -

            Thanks for the update, though that didn't answer my question of what "this" means.

            G B added a comment - Thanks for the update, though that didn't answer my question of what "this" means.

            Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3?

            Apologies, this is not going to make 4.3. I've removed '4.3' from the fix version to avoid confusion. This is something I'd like to tackle in 4.3.x but I'm hesitant to make any promises.

            Cheers,
            Peter

            Peter Leschev added a comment - Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3? Apologies, this is not going to make 4.3. I've removed '4.3' from the fix version to avoid confusion. This is something I'd like to tackle in 4.3.x but I'm hesitant to make any promises. Cheers, Peter

            G B added a comment -

            There are two problems described in this ticket.

            Problem 1) Liberal stemming behavior sometimes results in unexpected matches

            Problem 2) User's can't match an underscore in a search term or can't do a partial word search on phrases (two words joined by an underscore) due to Lucene behavior. (JRA-14641 is dedicated to this problem.)

            Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3?

            G B added a comment - There are two problems described in this ticket. Problem 1) Liberal stemming behavior sometimes results in unexpected matches Problem 2) User's can't match an underscore in a search term or can't do a partial word search on phrases (two words joined by an underscore) due to Lucene behavior. ( JRA-14641 is dedicated to this problem.) Can Atlassian state which one they intend to work on in this issue that is currently slated for JIRA 4.3?

            We have a wealth of knowledge on our Confluence system that contains underscores. This knowledge is not easily accessible using search. Like ourselves, the majority of tech companies will have a myriad of terms containing underscores that are almost impossible to locate via search.

            Ryan McCollum added a comment - We have a wealth of knowledge on our Confluence system that contains underscores. This knowledge is not easily accessible using search. Like ourselves, the majority of tech companies will have a myriad of terms containing underscores that are almost impossible to locate via search.

            bain added a comment -

            Just wanted to also point out that it would be nice if it was possible to find URLS using prefix queries. Check out JRA-17463.

            bain added a comment - Just wanted to also point out that it would be nice if it was possible to find URLS using prefix queries. Check out JRA-17463 .

            @Greg Miller

            Our current work around is to set the Lucene Indexer Language to Other in the General configuration section. This will not do any stemming. However it has a cost such that searches such as "cat" wont find "cats". Please be aware of this.

            ɹǝʞɐq pɐɹq added a comment - @Greg Miller Our current work around is to set the Lucene Indexer Language to Other in the General configuration section. This will not do any stemming. However it has a cost such that searches such as "cat" wont find "cats". Please be aware of this.

            We are also having a problem with customers searching our Technical Support documents (using Confluence) if the search query contains an underscore. The issue is listed here:

            http://jira.atlassian.com/browse/CONF-14554

            We can't use wildcards when "_" is in the searh term. We are getting a lot of customer complaints about this.

            Is there some type of workaround?

            -Greg

            Greg Miller added a comment - We are also having a problem with customers searching our Technical Support documents (using Confluence) if the search query contains an underscore. The issue is listed here: http://jira.atlassian.com/browse/CONF-14554 We can't use wildcards when "_" is in the searh term. We are getting a lot of customer complaints about this. Is there some type of workaround? -Greg

            G B added a comment -

            Tom, you appear to have hit the same problem that I just reported in JRA-14641. In the Lucene
            query format "CSS Report" and CSS_Report are indistiguishable. Further, both are considered
            search phrases and wildcards are not allowed in search phrases. Therefore it is impossible
            to do the search you want.

            I have started setting up saved searches where I programmatically generate a list of all possible
            matching search results and insert that rediculously long list into the "Query" field in the search.

            e.g. Query: CSS_Report_v1.001 OR CSS_Report_v1.002 OR CSS_Report_v1.003 OR ... OR CSS_Report_v1.999

            I have successfully tested this up to 10000 enumerations. At 100000 I think the textbox in firefox
            broke, so there's a limit in there somewhere.

            This workaround assumes that you can enumerate the list of things that you are looking for and
            that there are fewer than 10,000 of them. In most cases, we can't.

            G B added a comment - Tom, you appear to have hit the same problem that I just reported in JRA-14641 . In the Lucene query format "CSS Report" and CSS_Report are indistiguishable. Further, both are considered search phrases and wildcards are not allowed in search phrases. Therefore it is impossible to do the search you want. I have started setting up saved searches where I programmatically generate a list of all possible matching search results and insert that rediculously long list into the "Query" field in the search. e.g. Query: CSS_Report_v1.001 OR CSS_Report_v1.002 OR CSS_Report_v1.003 OR ... OR CSS_Report_v1.999 I have successfully tested this up to 10000 enumerations. At 100000 I think the textbox in firefox broke, so there's a limit in there somewhere. This workaround assumes that you can enumerate the list of things that you are looking for and that there are fewer than 10,000 of them. In most cases, we can't.

            Tom Clarkson added a comment - - edited

            This is a serious problem (Enterprise 3.12.1) - results missing, false positives... users can't find what they need to:

            I created a test project to investigate, and created 5 issues with variations on name:

            CSS_Report v1.234
            CSS_Reporting_v1.234
            CSS_Reports_v1.234
            CSS Report v1.234
            CSS-Report v1.234

            (note the variations in underscore, dash and space in the name)

            Search Term Expected results Jira's results Correct?
            CSS* All 5 All 5
            "CSS" not sure (I'd expect just "CSS Report v1.234") All 5
            "CSS " just "CSS Report v1.234" All 5
            "CSS_" Just the three with underscores All 5
            CSS_* Just the three with underscores none
            CSS_Report one (CSS_Report v1.234) CSS_Report v1.234
            CSS Report v1.234 ?
            CSS-Report v1.234 ?
            CSS_Report* Just the three with underscores none
            "CSS_Report*" Just the three with underscores CSS_Report v1.234
            CSS Report v1.234 ?
            CSS-Report v1.234

            I have not yet found a search which will return just the names which start with "CSS_Report"... (ie the typical results expected from a search such as CSS_Report*)

            Can anyone shed any light on how to return these results?

            Tom Clarkson added a comment - - edited This is a serious problem (Enterprise 3.12.1) - results missing, false positives... users can't find what they need to: I created a test project to investigate, and created 5 issues with variations on name: CSS_Report v1.234 CSS_Reporting_v1.234 CSS_Reports_v1.234 CSS Report v1.234 CSS-Report v1.234 (note the variations in underscore, dash and space in the name) Search Term Expected results Jira's results Correct? CSS* All 5 All 5 "CSS" not sure (I'd expect just "CSS Report v1.234") All 5 "CSS " just "CSS Report v1.234" All 5 "CSS_" Just the three with underscores All 5 CSS_* Just the three with underscores none CSS_Report one (CSS_Report v1.234) CSS_Report v1.234 CSS Report v1.234 ? CSS-Report v1.234 ? CSS_Report* Just the three with underscores none "CSS_Report*" Just the three with underscores CSS_Report v1.234 CSS Report v1.234 ? CSS-Report v1.234 I have not yet found a search which will return just the names which start with "CSS_Report"... (ie the typical results expected from a search such as CSS_Report*) Can anyone shed any light on how to return these results?

            Added by vote as one of our users just complained that they can't search for the word "stepped" without getting a load of results matching "step", "steps" etc., which for us is quite a big deal.

            I can try out the workaround, but presumably changing the index language will then remove the ability for people to do explicit fuzzy searches and return "stepping", "steps", "stepped" etc. with a search for "step~"?

            Which raises a question for me - what's the point of the fuzzy search mechanism if all searches are fuzzy anyway?

            Neil Arrowsmith added a comment - Added by vote as one of our users just complained that they can't search for the word "stepped" without getting a load of results matching "step", "steps" etc., which for us is quite a big deal. I can try out the workaround, but presumably changing the index language will then remove the ability for people to do explicit fuzzy searches and return "stepping", "steps", "stepped" etc. with a search for "step~"? Which raises a question for me - what's the point of the fuzzy search mechanism if all searches are fuzzy anyway?

            Sorry, I posted my comments in the "duplicate" (9240) by mistake.....
            ============

            This problem is more than just stemming, so as a matter of principle I don't want a workaround that only addresses the stemming/ignored words problem.

            The root of the problem is that JIRA/Lucene/whomever does not respect the user's attempt to enter an exact phrase.. Any decent text search engine, IMHO, should:

            1) allow users to switch to an "exact phrase" mode, via some preference-toggle or widget
            --or, better yet,
            2) allow users to surround the search terms (or any part of them) in quotes; and when encountering the quotes, treat that portion as an "untouchable" character sequence without any splitting or stemming. (This is so basic! I think by now, most web power users will instinctively try quotes for phrase searching.)

            If we can do this, then all of the other symptoms in these discussions (and they are all symptoms of the same cause) will either disappear or have a highly-usable workaround. Don't want stemming? Don't want to drop words? Need to search an exact phrase? Just surround it in quotes.

            John M. Black added a comment - Sorry, I posted my comments in the "duplicate" (9240) by mistake..... ============ This problem is more than just stemming, so as a matter of principle I don't want a workaround that only addresses the stemming/ignored words problem. The root of the problem is that JIRA/Lucene/whomever does not respect the user's attempt to enter an exact phrase. . Any decent text search engine, IMHO, should: 1) allow users to switch to an "exact phrase" mode, via some preference-toggle or widget --or, better yet, 2) allow users to surround the search terms (or any part of them) in quotes; and when encountering the quotes, treat that portion as an "untouchable" character sequence without any splitting or stemming. (This is so basic! I think by now, most web power users will instinctively try quotes for phrase searching.) If we can do this, then all of the other symptoms in these discussions (and they are all symptoms of the same cause) will either disappear or have a highly-usable workaround. Don't want stemming? Don't want to drop words? Need to search an exact phrase? Just surround it in quotes.

            Thanks Neal. You guys are the best. This is a "CLOSED" issue for me.

            Mr Automation Guy added a comment - Thanks Neal. You guys are the best. This is a "CLOSED" issue for me.

            According to the documentation:

            Note: All query terms in JIRA are case insensitive.

            Neal Applebaum added a comment - According to the documentation : Note: All query terms in JIRA are case insensitive.

            Neil I appologize, IT WORKED, I might have done something wrong. How ever looking for '[PRODUCTION]' return "Production' as well. is there anyway too look Cas-sensetive and/or look for [PRODUCTION]. I thought looking for "[PRODUCTION]" should only return [PRODUCTION].

            ~Omid

            Mr AutomationGuy added a comment - Neil I appologize, IT WORKED, I might have done something wrong. How ever looking for ' [PRODUCTION] ' return "Production' as well. is there anyway too look Cas-sensetive and/or look for [PRODUCTION] . I thought looking for "[PRODUCTION]" should only return [PRODUCTION] . ~Omid

            Hi Neal:

            To eliminate doublts I'm going to start over and do the entire process and take snapshot and place it here.

            Here we go.............

            ~Omid

            Mr AutomationGuy added a comment - Hi Neal: To eliminate doublts I'm going to start over and do the entire process and take snapshot and place it here. Here we go............. ~Omid

            I followed Atlassian's instructions, and it worked for me (standalone) just fine. The stemming problem went away when Indexing was set to Other. Omid - your comment was a little cryptic (e.g. was the typo unintentional, how many rows were returned when searching for Production in English vs. in Other setting)?

            In my test, I did a search for issues with "product" in issue summary, and it found 4 issues, including hits on "product", "products", "production". When I re-indexed with "Other" as indexing language, the search found only 2 - the 2 with exactly "product". Only when I searched on "product*" did it find all 4.

            Are you sure the search didn't find some issues because the search included more fields (e.g. description, comments) where the word was also found?

            Neal Applebaum added a comment - I followed Atlassian's instructions, and it worked for me (standalone) just fine. The stemming problem went away when Indexing was set to Other. Omid - your comment was a little cryptic (e.g. was the typo unintentional, how many rows were returned when searching for Production in English vs. in Other setting)? In my test, I did a search for issues with "product" in issue summary, and it found 4 issues, including hits on "product", "products", "production". When I re-indexed with "Other" as indexing language, the search found only 2 - the 2 with exactly "product". Only when I searched on "product*" did it find all 4. Are you sure the search didn't find some issues because the search included more fields (e.g. description, comments) where the word was also found?

            Correct Nick:

            This is the steps I have performed:

            Steps to reproduce:
            !) changed the "Indexing Language" in the General configuraiton to "Other" from "English"
            2) re-index
            3) Re-Start the service
            4) Search for "[PRODUCTION]"

            Result: retuned Query:
            [PRODUCITON] ....
            Major Product

            ~Omid

            Mr AutomationGuy added a comment - Correct Nick: This is the steps I have performed: Steps to reproduce: !) changed the "Indexing Language" in the General configuraiton to "Other" from "English" 2) re-index 3) Re-Start the service 4) Search for " [PRODUCTION] " Result: retuned Query: [PRODUCITON] .... Major Product ~Omid

            Hi guys,
            If you set the indexing language to "Other", Lucene will no longer stem words and you will have exact results after a reindex.

            Omid,
            Are you saying this is not the case?

            Nick Menere [Atlassian] (Inactive) added a comment - Hi guys, If you set the indexing language to "Other" , Lucene will no longer stem words and you will have exact results after a reindex. Omid, Are you saying this is not the case?

            I think Atlassian needs to respond to us about this issue. It doesn't matter if the searching is fast if it is inaccurate, IMHO. According to Atlassian, the stemming problem should go away when indexing is not set to English.

            Neal Applebaum added a comment - I think Atlassian needs to respond to us about this issue. It doesn't matter if the searching is fast if it is inaccurate, IMHO. According to Atlassian, the stemming problem should go away when indexing is not set to English.

            Neal:

            I did in-fact re-indexed and also to make sure I bounced server too the result is the same. I switched it back to "English" now. Alos I only have "English" as my choices, do I need to install "Lucene Indexer Language" or "English" is fine.

            David:

            After following the instructins above we are still not getting the kind of search we are hoping to get. Searching for "[PRODUCTION]" still returns "Major Product" as one of the queries. It seems strange to me that it find "Product" when I look for PRODUCTION, hmm.

            ~Omid

            Mr Automation Guy added a comment - Neal: I did in-fact re-indexed and also to make sure I bounced server too the result is the same. I switched it back to "English" now. Alos I only have "English" as my choices, do I need to install "Lucene Indexer Language" or "English" is fine. David: After following the instructins above we are still not getting the kind of search we are hoping to get. Searching for " [PRODUCTION] " still returns "Major Product" as one of the queries. It seems strange to me that it find "Product" when I look for PRODUCTION, hmm. ~Omid

            Hello folks,

            The organization I work for is also experiencing great pain with this searching issue. I read through this thread of messages but could not tell if switching to "other" and re-indexing provided relief from the text searching issue. I would appreciate it if someone could comment on whether or not this works.

            Thanks in advance,
            Dave

            David Zawalski added a comment - Hello folks, The organization I work for is also experiencing great pain with this searching issue. I read through this thread of messages but could not tell if switching to "other" and re-indexing provided relief from the text searching issue. I would appreciate it if someone could comment on whether or not this works. Thanks in advance, Dave

            Silly question, but ... after changing the indexing language from English to Other ... did you re-index?

            Neal Applebaum added a comment - Silly question, but ... after changing the indexing language from English to Other ... did you re-index?

            This is still NOT WORKING for me.

            The Search Indexing lanugage in my version is set to English, so when I changed it to "other" I couldn't even find a simple text. I couldn't even find any issue when I did filter.

            I switched it back to English, I still have the old issue, searched for [PRODUCTION] and found "Product", "This is a production..".

            ~Omid

            Mr AutomationGuy added a comment - This is still NOT WORKING for me. The Search Indexing lanugage in my version is set to English, so when I changed it to "other" I couldn't even find a simple text. I couldn't even find any issue when I did filter. I switched it back to English, I still have the old issue, searched for [PRODUCTION] and found "Product", "This is a production..". ~Omid

            Melissa added a comment -

            My environment also has this issue. In fact, just today our sr QA guy told me he had lost confidence in JIRA's search results, and that without being able to accurately find the issues he needs the rest of the functionality was meaningless. I figured others must be having some problems too, and was able to find this thread - although it was fairly time consuming and I had to hunt through a lot of other issues. Before today, I had no idea about Lucene and the stemming. It seems like it would be preferable to have installations default to 'other' so that stemming is off. It is counter intuitive to folks who are used to doing text searches, and while I can it might be beneficial for broadening searches, wild cards handle that more effectively and accurately anyway. Accurate and precise results are much more important than broad results.

            Melissa added a comment - My environment also has this issue. In fact, just today our sr QA guy told me he had lost confidence in JIRA's search results, and that without being able to accurately find the issues he needs the rest of the functionality was meaningless. I figured others must be having some problems too, and was able to find this thread - although it was fairly time consuming and I had to hunt through a lot of other issues. Before today, I had no idea about Lucene and the stemming. It seems like it would be preferable to have installations default to 'other' so that stemming is off. It is counter intuitive to folks who are used to doing text searches, and while I can it might be beneficial for broadening searches, wild cards handle that more effectively and accurately anyway. Accurate and precise results are much more important than broad results.

            I 2nd that. I am still waiting for some sort of solution. lol

            Mr AutomationGuy added a comment - I 2nd that. I am still waiting for some sort of solution. lol

            Any ETA on this? It was created 28/Dec/04... to bad bugs don't die of old age!

            Bernard Durfee added a comment - Any ETA on this? It was created 28/Dec/04... to bad bugs don't die of old age!

            AntonA added a comment -

            Ted,

            You are right, you will need to reindex.

            If stemming is "disabled" (i.e. not done) then it will always only find exact matching, unless you use wildcards, etc. In case of wild cards I believe the ranking will be done.

            Please let us know how you go.

            Thanks,
            Anton

            AntonA added a comment - Ted, You are right, you will need to reindex. If stemming is "disabled" (i.e. not done) then it will always only find exact matching, unless you use wildcards, etc. In case of wild cards I believe the ranking will be done. Please let us know how you go. Thanks, Anton

            We're going to set the indexing language to

            {other}

            this weekend during our next maintenance window. I presume that we will have to reindex afterwards to realize the benefit of this.

            I'll have a look at the dropped word business afterwards as well. I looked through a Lucene implementation and my recollection (admittedly foggy) was that the stop words were handled outside the stemming algorithm but I am prepared to be pleasantly surprised!

            With respect to returning higher ranked results first, if we turn off stemming, it shoudl be possible to do that, right?

            Ted

            Ted Pietrzak added a comment - We're going to set the indexing language to {other} this weekend during our next maintenance window. I presume that we will have to reindex afterwards to realize the benefit of this. I'll have a look at the dropped word business afterwards as well. I looked through a Lucene implementation and my recollection (admittedly foggy) was that the stop words were handled outside the stemming algorithm but I am prepared to be pleasantly surprised! With respect to returning higher ranked results first, if we turn off stemming, it shoudl be possible to do that, right? Ted

            AntonA added a comment -

            Hi Ted,

            By dropped words I assume you are referring to stop words mentioned on this page:
            http://confluence.atlassian.com/display/JIRA/Words+ignored+when+searching

            I believe if the indexing language is set to

            {Other}

            all of the words will be indexed and no stemming will be performed. Have you tried using this configuration? Does it make searching meet your requirements?

            We really appreciate your feedback, and we will definitely take it into consideration. As Nick mentioned, removing Lucene would be quite costly at the moment. But maybe we can improve it in the future. One way that I see, is that if we can make exact matches return results that score higher. Lucene does rank results by relevance. The problem is that once the words are stemmed, Lucene does not know the difference between 'customer' and 'custom' and hence cannot rank them properly. Unfortunately, at the momnet I cannot make any promises about the implementation date for these improvements.

            Please let us know if setting the indexing language to Other helps.

            Thanks,
            Anton

            AntonA added a comment - Hi Ted, By dropped words I assume you are referring to stop words mentioned on this page: http://confluence.atlassian.com/display/JIRA/Words+ignored+when+searching I believe if the indexing language is set to {Other} all of the words will be indexed and no stemming will be performed. Have you tried using this configuration? Does it make searching meet your requirements? We really appreciate your feedback, and we will definitely take it into consideration. As Nick mentioned, removing Lucene would be quite costly at the moment. But maybe we can improve it in the future. One way that I see, is that if we can make exact matches return results that score higher. Lucene does rank results by relevance. The problem is that once the words are stemmed, Lucene does not know the difference between 'customer' and 'custom' and hence cannot rank them properly. Unfortunately, at the momnet I cannot make any promises about the implementation date for these improvements. Please let us know if setting the indexing language to Other helps. Thanks, Anton

            Nick,
            I appreciate the thought on stemming but it is only the tip of the iceberg with respect to Lucene. Then we have the issue of words that are dropped. And enforced case insensitivity and fragility of the indices.

            If Lucene is too deeply embedded in what you're doing, so be it. Give us a way to bypass it and use a more precise search mechanism. If I could just get to the varchar fields directly, that would be a huge improvement. So many of our problems could be addressed if there were a mechanism I could use to access the database engine's built-in capabilities rather than being lumbered with Lucene.

            Sorry to keep beating this horse but this is a daily source of pain. I get a call or an e-mail at least once a day with some new, bizarre search result. The good news is that I've been rather forcibly educated on how Lucene works. The bad news is that the answer is all too often "Sorry, can't help you with that."

            Ted

            Ted Pietrzak added a comment - Nick, I appreciate the thought on stemming but it is only the tip of the iceberg with respect to Lucene. Then we have the issue of words that are dropped. And enforced case insensitivity and fragility of the indices. If Lucene is too deeply embedded in what you're doing, so be it. Give us a way to bypass it and use a more precise search mechanism. If I could just get to the varchar fields directly, that would be a huge improvement. So many of our problems could be addressed if there were a mechanism I could use to access the database engine's built-in capabilities rather than being lumbered with Lucene. Sorry to keep beating this horse but this is a daily source of pain. I get a call or an e-mail at least once a day with some new, bizarre search result. The good news is that I've been rather forcibly educated on how Lucene works. The bad news is that the answer is all too often "Sorry, can't help you with that." Ted

            Ted,

            I don't think dropping Lucene is practicle for us as it is fundamental to lot of the things we do within Jira.

            To get the results you guys are after, change your indexing language to {{Other}}and you will get the results you guys are expecting.
            Other does not do any stemming at all (though I believe it still gets rid of special characters).

            Cheers,
            Nick

            Nick Menere [Atlassian] (Inactive) added a comment - Ted, I don't think dropping Lucene is practicle for us as it is fundamental to lot of the things we do within Jira. To get the results you guys are after, change your indexing language to {{Other}}and you will get the results you guys are expecting. Other does not do any stemming at all (though I believe it still gets rid of special characters). Cheers, Nick

            Amen!

            Neal Applebaum added a comment - Amen!

            All the more reason to ditch Lucene! The results of queries are all too often seemingly inexplicable to the end user community. If you have to spend 10 minutes explaining why they got some goofy result and then tell them that there is no way to get the correct result, then something is wrong.

            Lucene style fuzzy matches may be jim dandy for web search engines but that's not what we're dealing with in a defect tracking system. People want precise search results. They generally have a pretty good idea of what they're looking for. If the correct results are obscured under the weight of 100's of incorrect results, that's not useful. Lucene doesn't even offer the mitigation of "weighing" the results of a search and putting the "heaviest" results first in the list.

            So, it's clear that the real benefit of Lucene is that it offers a database engine agnostic way of searching textual fields. I would cheerfully trade universal database support for a product that supports a 2-3 top notch databases and gives the right answer when I do searches.

            Ted Pietrzak added a comment - All the more reason to ditch Lucene! The results of queries are all too often seemingly inexplicable to the end user community. If you have to spend 10 minutes explaining why they got some goofy result and then tell them that there is no way to get the correct result, then something is wrong. Lucene style fuzzy matches may be jim dandy for web search engines but that's not what we're dealing with in a defect tracking system. People want precise search results. They generally have a pretty good idea of what they're looking for. If the correct results are obscured under the weight of 100's of incorrect results, that's not useful. Lucene doesn't even offer the mitigation of "weighing" the results of a search and putting the "heaviest" results first in the list. So, it's clear that the real benefit of Lucene is that it offers a database engine agnostic way of searching textual fields. I would cheerfully trade universal database support for a product that supports a 2-3 top notch databases and gives the right answer when I do searches.

            at the moment the only way to do this is change the Lucene Indexer Language to Other in the General configuration section. this will not do any stemming. Stemming is a very ticky thing and I don;t think anyone will ever create a perfect English stemmer.

            Nick Menere [Atlassian] (Inactive) added a comment - at the moment the only way to do this is change the Lucene Indexer Language to Other in the General configuration section. this will not do any stemming. Stemming is a very ticky thing and I don;t think anyone will ever create a perfect English stemmer.

            See http://jira.atlassian.com/browse/JRA-6187#action_52146

            Scott, how it should work is if I search on "custom" it finds only "custom". If I search on "custom*" it will find customer as well. It's not just annoying, it's wrong.

            Neal Applebaum added a comment - See http://jira.atlassian.com/browse/JRA-6187#action_52146 Scott, how it should work is if I search on "custom" it finds only "custom". If I search on "custom*" it will find customer as well. It's not just annoying, it's wrong.

            We also have this issue and we have 1300 issues which have the word [PRODUCTION] and when we search we get all the ones that have "PRODUCT". It is very annoying.

            Mr Automation Guy added a comment - We also have this issue and we have 1300 issues which have the word [PRODUCTION] and when we search we get all the ones that have "PRODUCT". It is very annoying.

            How would you want this to work? I think that it should match both, although perhaps 'customer' matches should be higher?

            Scott Farquhar added a comment - How would you want this to work? I think that it should match both, although perhaps 'customer' matches should be higher?

              ohernandez@atlassian.com Oswaldo Hernandez (Inactive)
              7ee5c68a815f Jeff Turner
              Affected customers:
              75 This affects my team
              Watchers:
              41 Start watching this issue

                Created:
                Updated:
                Resolved: