Issue Details (XML | Word | Printable)

Key: CONF-5142
Type: Bug Bug
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Jeremy Higgs
Votes: 9
Watchers: 8
Operations

Add/Edit UI Mockup to this issue
If you were logged in you would be able to see more operations.
Confluence

Stemming and wildcards do not play nicely in search queries

Created: 11/Jan/06 12:30 AM   Updated: 28/Jan/10 08:58 PM
Return to search
Component/s: Searching / Indexing
Affects Version/s: 2.0, 2.1.1
Fix Version/s: None

Time Tracking:
Not Specified

Issue Links:
Reference
 

Participants: Anatoli Kazatchkov [Atlassian], Han Chen, Jeremy Higgs, Martin Barkanowitz, Nicole Redaschi, Piotr Zoladz and Tony Atkins [Atlassian]
Since last comment: 1 week, 4 days ago
Internal Complexity: 6
Internal Value: 3
Labels:


 Description  « Hide

There is a slight issue with searching in that if you search for a part of a word and apply a wildcard, Lucene doesn't find the word you intended.

e.g. if you search for "Management" (no quotes) on CAC, it returns a bunch of results. A search for "Managemen*", however, only returns one.

The reason for this is that "Managemen" is not a real English word, and so is not stemmed. So, the query term does not match the stemmed version of "management", "manag" that we have in the index, and the correct results aren't returned. (Note: the attachment returned by the wildcard query is due to the indexing of the full filename, which then matches "managemen*")

A solution to this may be to store the original word (as well as the stemmed) in a different field in the index. When a wildcard search term comes through, search the full and stemmed words. The cache may be bigger, and there may be a slight performance hit, but it will make searching a bit more reliable in these edge cases.



Nicole Redaschi added a comment - 25/Jan/06 04:42 AM

This behaviour has really confused us, especially because the stemming does not always seem to work as expected. Here are some example results from our Confluence server:

search for alias etc.:
7 for alias
2 for aliases
2 for alias*

search for evidence etc.:
47 for evidence
47 for evidences
3 for evidence*
1 for evidenc
4 for evidenc*

Because our working language is English, but this is not the mother tongue of most of our users, they often use * because they are unsure about the correct English spelling. It would be great if you could implement the solution you suggest above.


Martin Barkanowitz added a comment - 01/Jul/09 08:45 AM

We have the same problem reported from a bunch of users in my company. Because we're using the search engine as an essential function.

I really can't believe that this important issue is not even assigned to someone by Atlassian since over 3 years!

This is still not fixed in Confluence 2.10.3. Sorry, but this is not a good job.
Please fix it!

Regards,
Martin


Tony Atkins [Atlassian] added a comment - 06/Jul/09 04:19 AM

I can still reproduce this on CAC.

Here's another variation: Try searching for "management -manage" I would expect to get management but not manage. Because both the positive and negative terms are stemmed, I get zero results.


Tony Atkins [Atlassian] added a comment - 06/Jul/09 04:47 AM

In my testing, the word "commit" is a great example:

  1. "commit" (no wildcard) matches the literal word "commit" as well as the variations.
  2. The search term "commits" (no wildcard) returns all variations on "commit" and the same number of resutls
  3. "commit*" matches the literal word "commit" as well as the variations.
  4. The search term "commits*" returns a single match for the literal "commits".

So: stemming works, reverse stemming works, stemming with a wildcard works, reverse stemming with a wildcard doesn't.


Piotr Zoladz added a comment - 19/Aug/09 09:00 AM

I've similar problem.

Go to Administration | Global settings | edit configuration
Try to change Jira indexing language from "english" to "other". And repeat you tests.

Here is some description with screens : http://www.atlassian.com/software/jira/docs/v3.13/configure.html

I've search it for a while, and it is probably connected with Lucene search mechanism which has some special searching functions when searching in English language, but not necessary great when you need exact results with wild cards characters.


Anatoli Kazatchkov [Atlassian] added a comment - 01/Sep/09 01:26 AM

Piotr,

Although in this case your comment is applicable to confluence you are probably talking about jira. Confluence also has the same setting for indexing language.

Anatoli.


Han Chen added a comment - 28/Jan/10 08:58 PM

I've confirmed a another issue with wildcard.

If you use wildcard in your search, the search becomes case-sensitive. i.e.
the result for "C?nfluence" and "c?nfluence" are not the same.

Han