Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-9258

Incorrect search results for single and double-byte Japanese strings

    • Icon: Suggestion Suggestion
    • Resolution: Fixed
    • 2.6.2
    • None
    • All with internationalization settings set correctly and using CJK indexing
    • We collect Confluence feedback from various sources, and we evaluate what we've collected when planning our product roadmap. To understand how this piece of feedback will be reviewed, see our Implementation of New Features Policy.

      Hello,

      We noticed incorrect behavior in how Confluence searches for single-bye and double-byte Japanese strings. Search for roman alphabets yields results incorporating both single and double-byte matches, as it should be. Similar behavior is required when searching for Japanese characters, for example, search for single byte katakana or numeric characters should return results matching both single and double byte occurrences, but now only double-byte matches are being retrieved. Please see attached Excel sheets summarizing the current behavior and the required behavior for single-byte and double-byte Japanese strings. Could you please investigate this and incorporate this improvement in a future release?

      Thanks,

      Neeraj

            [CONFSERVER-9258] Incorrect search results for single and double-byte Japanese strings

            systeminfo.png was removed as the customer's request.

            Sean Osawa (Inactive) added a comment - systeminfo.png was removed as the customer's request.

            Removed a test result file, which had been attached to this issue, as requested by the customer.

            Sean Osawa (Inactive) added a comment - Removed a test result file, which had been attached to this issue, as requested by the customer.

            Agnes, could you please review these changes prior to 2.6.2?

            Thanks.

            Paul Curren added a comment - Agnes, could you please review these changes prior to 2.6.2? Thanks.

            Yes, would be great if you can provide a patch for 2.6.0 that address CONF-9258, CONF-9833 & CONF-9834, which are all major issues from Japanese QA perspective.

            Neeraj Jhanji added a comment - Yes, would be great if you can provide a patch for 2.6.0 that address CONF-9258 , CONF-9833 & CONF-9834 , which are all major issues from Japanese QA perspective.

            Hi Neeraj,

            The changes were not too numerous, so I should be able to provide a patch for 2.6.0 if essential.

            The search tab has not yet been fixed, I have created another issue (CONF-9834) which should be watched for updates.

            Andrew Lynch (Inactive) added a comment - Hi Neeraj, The changes were not too numerous, so I should be able to provide a patch for 2.6.0 if essential. The search tab has not yet been fixed, I have created another issue ( CONF-9834 ) which should be watched for updates.

            Neeraj Jhanji added a comment - - edited

            We are focused on releasing Confluence 2.6 to Japanese customers. This is a major upgrade from the previous JP release 2.4.3 since it fixes a major issue surrounding PDF export. If possible, we'd like to include the search fix in this release as well since the next Japanese release will be further out (possibly next year). Is it possible to get a patch for 2.6?

            Also, to confirm, will search for half width katakana and numeric characters work properly from the top search bar as well as from the search tab?

            Neeraj Jhanji added a comment - - edited We are focused on releasing Confluence 2.6 to Japanese customers. This is a major upgrade from the previous JP release 2.4.3 since it fixes a major issue surrounding PDF export. If possible, we'd like to include the search fix in this release as well since the next Japanese release will be further out (possibly next year). Is it possible to get a patch for 2.6? Also, to confirm, will search for half width katakana and numeric characters work properly from the top search bar as well as from the search tab?

            Hi Neeraj,

            My apologies, I provide the wrong issue number. It should be LUCENE-1032.

            The custom Japanese analyzer will be available as an option in the Content Indexing tab from 2.6.2 onwards.

            Regards,

            Andrew Lynch

            Andrew Lynch (Inactive) added a comment - Hi Neeraj, My apologies, I provide the wrong issue number. It should be LUCENE-1032. The custom Japanese analyzer will be available as an option in the Content Indexing tab from 2.6.2 onwards. Regards, Andrew Lynch

            Hi Andrew,

            1. Where can I get the Custom Japanese Analyzer and what are the installation instructions for it?

            2. Regarding the Lucene support issue you mention above, I did not see any mention of the problems with half-width Japanese characters.

            Please clarify.

            regards,

            Neeraj

            Neeraj Jhanji added a comment - Hi Andrew, 1. Where can I get the Custom Japanese Analyzer and what are the installation instructions for it? 2. Regarding the Lucene support issue you mention above, I did not see any mention of the problems with half-width Japanese characters. Please clarify. regards, Neeraj

            Fixed by use of custom Analyzer.

            Andrew Lynch (Inactive) added a comment - Fixed by use of custom Analyzer.

            Andrew Lynch (Inactive) added a comment - - edited

            A quick update Neeraj,

            Lucene's CJKAnalyzer is definitely not indexing half width characters correctly. I have raised an issue (http://issues.apache.org/jira/browse/LUCENE-1032) to address this.
            We were considering creating a patch ourselves, but the simplest implementation would require usage of Java 6's Normalizer class. In order to solve this, I have created our Analyzer, Custom Japanese Analyzer. Unfortunately this only works on Sun JDKs and so it will not be incorporated into Lucene and may not work for all customers.

            Customers who are experiencing problems such as the ones you outlined should use this Analyzer in place of CJKAnalyzer until the issue with Lucene is resolved, assuming they have a Sun JDK.

            Andrew Lynch (Inactive) added a comment - - edited A quick update Neeraj, Lucene's CJKAnalyzer is definitely not indexing half width characters correctly. I have raised an issue ( http://issues.apache.org/jira/browse/LUCENE-1032 ) to address this. We were considering creating a patch ourselves, but the simplest implementation would require usage of Java 6's Normalizer class. In order to solve this, I have created our Analyzer, Custom Japanese Analyzer. Unfortunately this only works on Sun JDKs and so it will not be incorporated into Lucene and may not work for all customers. Customers who are experiencing problems such as the ones you outlined should use this Analyzer in place of CJKAnalyzer until the issue with Lucene is resolved, assuming they have a Sun JDK.

              agnes@atlassian.com Agnes Ro
              jhanji@imahima.com Neeraj Jhanji
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: