Issue Details (XML | Word | Printable)

Key: CONF-9258
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Fixed
Priority: Critical Critical
Assignee: Agnes Ro [Atlassian]
Reporter: Neeraj Jhanji [Atlassian]
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Confluence

Incorrect search results for single and double-byte Japanese strings

Created: 22/Aug/07 05:40 AM   Updated: 16/Apr/08 04:32 AM
Component/s: Internationalisation
Affects Version/s: 2.4.3
Fix Version/s: 2.6.2

Time Tracking:
Not Specified

File Attachments: 1. Microsoft Excel Character requirement (double).xls (17 kB)
2. Microsoft Excel Character requirement (single).xls (17 kB)
3. Microsoft Excel Character requirement (single).xls (17 kB)
4. Microsoft Excel Character+requirement+(single).xls (17 kB)
5. Microsoft Excel Search_Test_Results.xls (75 kB)

Image Attachments:

1. Screenshot-Site Search - Confluence - Mozilla Firefox.png
(147 kB)

2. systeminfo.png
(80 kB)
Environment: All with internationalization settings set correctly and using CJK indexing
Issue Links:
Part
 
Reference

Participants: Agnes Ro [Atlassian], Andrew Lynch [Atlassian], Matt Ryall [Atlassian], Neeraj Jhanji [Atlassian], Paul Curren [Atlassian] and Per Fragemann [Atlassian]
Since last comment: 47 weeks, 4 days ago
Resolution Date: 11/Nov/07 09:24 PM
Labels:


 Description  « Hide
Hello,

We noticed incorrect behavior in how Confluence searches for single-bye and double-byte Japanese strings. Search for roman alphabets yields results incorporating both single and double-byte matches, as it should be. Similar behavior is required when searching for Japanese characters, for example, search for single byte katakana or numeric characters should return results matching both single and double byte occurrences, but now only double-byte matches are being retrieved. Please see attached Excel sheets summarizing the current behavior and the required behavior for single-byte and double-byte Japanese strings. Could you please investigate this and incorporate this improvement in a future release?

Thanks,

Neeraj



 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
Neeraj Jhanji [Atlassian] added a comment - 19/Sep/07 06:25 PM
Could you decide the fix release for this bug?

Matt Ryall [Atlassian] added a comment - 20/Sep/07 12:01 AM
I've flagged this issue with our product management team. We'll update the issue as soon as we have this issue scheduled in a release.

Neeraj Jhanji [Atlassian] added a comment - 04/Oct/07 06:42 AM
Matt,

This is a critical issue for Japanese users - I would say the highest priority issue among open issues related to Japanese internationalization. Can we target it for a fix in 2.6.1 if possible?

Thanks,

Neeraj


Matt Ryall [Atlassian] added a comment - 04/Oct/07 11:13 PM
Hi Neeraj,

This is classed as an improvement, so it doesn't qualify for fixing in a stable release like 2.6.1. I believe the problem with searching for single-byte characters is related to Lucene's internal encoding, so it's not a trivial fix either.

I'll make sure your comment is flagged with our product manager.

Regards,
Matt


Per Fragemann [Atlassian] added a comment - 05/Oct/07 02:58 AM
Hi, although I do understand this is a big problem, it is certainly not a blocker in the sense that the application crashes, lots of data is lost, and so on, so I am reducing the prioriy to "critical" Please http://confluence.atlassian.com/display/Support/JIRA+usage+guidelines see our guidelines for more information on priorities.

Neeraj Jhanji [Atlassian] added a comment - 05/Oct/07 03:09 AM
Hi Per,

Appreciate your guidance. I rated this as blocker from sales and market development perspective since we cannot be the best enterprise wiki in the class if our search simply does not find matching documents. For Japanese customers, this is the same as if search does not work. I am fine as long as this gets addressed within a reasonable time-frame because we are getting a lot of heat from our customers on this issue.

cheers, Neeraj


Neeraj Jhanji [Atlassian] added a comment - 16/Oct/07 03:49 AM - edited
Attached a file showing Japanese text search results for various single-byte and double-bye Japanese string search samples. The ones circled are OK, the ones crossed are No-Go.

Andrew Lynch [Atlassian] added a comment - 21/Oct/07 09:09 PM
Hi Neeraj,

Thanks for providing the example documents. I am afraid I am not sure if understand the desired behavior completely.
Is the following example in line with your expectations?

Steps:

1) Create a page containing the full width character カ.
2) Perform a search for カ.
3) 1 result should be returned. (This appears to work correctly)
4) Search for the string カ. (half width character)
5) No results are returned (but you require / expect the same result obtained in step 3).

Please advise me if this behavior you are expecting. Unfortunately I have little experience with the Japanese language so I may be misinterpreting your request.

Thanks,
Andrew


Neeraj Jhanji [Atlassian] added a comment - 21/Oct/07 09:49 PM
Hi Andrew,

Thanks for looking into this. Let me explain as below:

PART 1 (Critical, applies to half-width katakana characters):
1) Create a page containing the half width character カ.
2) Perform a search for カ.
3) 1 result should be returned. (zero results are being returned erroneously)

PART 2 (High, applies to both half-width and full-width katakana characters):
4) Now create a new page containing the full width character カ
5) Perform a search for カ.
6) 2 results should be returned. (currently only 1 result (full-width) is returned)
7) Search for the string カ. (half width character)
8) 2 results should be returned. (currently zero results are returned)

PART 3 (High, applies to both half-width and full-width numeric characters):
9) Now create a page containing the full width character 1and another page containing the half-width character 1.
10) Perform a search for full-width1.
11) 2 results should be returned. (currently only 1 result (full-width) is returned)
12) Search for the half-width 1.
13) 2 results should be returned. (currently only 1 result (full-width) is returned, as in step 11)

Please let me know if you have further questions.

regards,
Neeraj


Andrew Lynch [Atlassian] added a comment - 21/Oct/07 11:17 PM
Thanks Neeraj,

I am unable to produce the behavior you have indicated for part 1. I have attached a screen shot of correct results being returned for a page containing a half width katakana character "ka".
That being said, I have discovered this result will only be returned if the top search bar is used (i.e. not the search in the search tab), so we may well have a bug which is leading to the results you have experienced.
Which search bar are you using to perform the search?
In addition, could you provide us with the system information from the administration tab of the machine you are encountering this problem on?

For part 3, when entering characters in both full width and half width, I was unable to retrieve any results. How exactly were you entering these characters?

Thanks,
Andrew


Neeraj Jhanji [Atlassian] added a comment - 22/Oct/07 06:14 PM
Hi Andrew,

Attached is the system information of the machine the client tested this problem on. I have not verified this for all cases, and assume the client was searching via the search tab.

Regarding the half and full-width numeric characters , you may well be right.

I guess now atleast you are as aware of the problem as I am. We basically need to ensure that no matter which search box is used, Confluence is able to search for half and full width katakana and numeric characters correctly – i.e it should pass tests for part 1-3.

Let me know how I can help further.

Neeraj


Andrew Lynch [Atlassian] added a comment - 24/Oct/07 06:59 PM
Hi Neeraj,

I realized that I was using the English indexer when performing my searches. After switching to CJK indexing, I was able to produce results consistent with your findings. I will investigate further.


Andrew Lynch [Atlassian] added a comment - 25/Oct/07 08:07 PM - edited
A quick update Neeraj,

Lucene's CJKAnalyzer is definitely not indexing half width characters correctly. I have raised an issue (http://issues.apache.org/jira/browse/LUCENE-1032) to address this.
We were considering creating a patch ourselves, but the simplest implementation would require usage of Java 6's Normalizer class. In order to solve this, I have created our Analyzer, Custom Japanese Analyzer. Unfortunately this only works on Sun JDKs and so it will not be incorporated into Lucene and may not work for all customers.

Customers who are experiencing problems such as the ones you outlined should use this Analyzer in place of CJKAnalyzer until the issue with Lucene is resolved, assuming they have a Sun JDK.


Andrew Lynch [Atlassian] added a comment - 25/Oct/07 08:08 PM
Fixed by use of custom Analyzer.

Neeraj Jhanji [Atlassian] added a comment - 25/Oct/07 08:41 PM
Hi Andrew,

1. Where can I get the Custom Japanese Analyzer and what are the installation instructions for it?

2. Regarding the Lucene support issue you mention above, I did not see any mention of the problems with half-width Japanese characters.

Please clarify.

regards,

Neeraj


Andrew Lynch [Atlassian] added a comment - 25/Oct/07 08:44 PM
Hi Neeraj,

My apologies, I provide the wrong issue number. It should be LUCENE-1032.

The custom Japanese analyzer will be available as an option in the Content Indexing tab from 2.6.2 onwards.

Regards,

Andrew Lynch


Neeraj Jhanji [Atlassian] added a comment - 25/Oct/07 09:43 PM - edited
We are focused on releasing Confluence 2.6 to Japanese customers. This is a major upgrade from the previous JP release 2.4.3 since it fixes a major issue surrounding PDF export. If possible, we'd like to include the search fix in this release as well since the next Japanese release will be further out (possibly next year). Is it possible to get a patch for 2.6?

Also, to confirm, will search for half width katakana and numeric characters work properly from the top search bar as well as from the search tab?


Andrew Lynch [Atlassian] added a comment - 28/Oct/07 07:55 PM
Hi Neeraj,

The changes were not too numerous, so I should be able to provide a patch for 2.6.0 if essential.

The search tab has not yet been fixed, I have created another issue (CONF-9834) which should be watched for updates.


Neeraj Jhanji [Atlassian] added a comment - 29/Oct/07 01:10 AM
Yes, would be great if you can provide a patch for 2.6.0 that address CONF-9258, CONF-9833 & CONF-9834, which are all major issues from Japanese QA perspective.

Paul Curren [Atlassian] added a comment - 08/Nov/07 05:57 PM
Agnes, could you please review these changes prior to 2.6.2?

Thanks.