Uploaded image for project: 'Confluence Data Center'
  1. Confluence Data Center
  2. CONFSERVER-10911

PDF Extractor unable to index chinese and Japanese characters

    XMLWordPrintable

Details

    Description

      Confluence's Lucene cannot search for Chinese characters (both traditional and simplified) in PDF file.
      The same characters can be indexed fine in Word DOC file.

      It appears that Confluence PDF Extractor fails to extract the chinese characters (See picture). Alphabets can be searched without any problem.

      Attachments

        1. characters_encoding_test.PNG
          characters_encoding_test.PNG
          46 kB
        2. chinesechars_pdf_fails.PNG
          chinesechars_pdf_fails.PNG
          46 kB
        3. search_chinese.PNG
          search_chinese.PNG
          40 kB
        4. test.doc
          62 kB
        5. test.pdf
          33 kB

        Issue Links

          Activity

            People

              akdominguez Katrina Walser (Inactive)
              rhartono Roy Hartono [Atlassian]
              Votes:
              4 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: